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Preface 


This book is intended to serve as a comprehensive text for the use of 
students in tests and measurements courses, whether in the field of psy- 
chology or education. It can also serve as a helpful adjunct to graduate 
courses in measurement, and as a practical guide to psychological testing 
for teachers and administrators. For use in undergraduate courses, the 
material has been organized in such a way that some of the more difficult 
sections can easily be omitted if necessary. For example, if his students 
have had little or no training in statistics, the instructor may wish to omit 
all or parts of Chapters 6, 7, 8, and 9. 

Tests and Measurements: Assessment and Prediction is based on four 
principles. The first of these principles is that the student must begin by 
learning some statistical concepts and procedures before he can thor- 
oughly understand the measurement of human behavior. One of the major 
aims of this book is to make the statistical concepts relevant to such meas- 
urement both understandable and interesting. Although most students 
tend to be afraid of statistics and reluctant to learn about them, these 
negative attitudes are probably traceable more to poor methods of instruc- 
tion than to the intrinsic nature of the subject matter. All too often statis- 
tical formulas are presented to classes without adequate preparation or 
explanation. In Tests and Measurements, the author has undertaken to 
provide the reader with some insights into why statistical methods are 
necessary and how they work. 

The second principle upon which this book is based is that the reader 
should be given a broad acquaintance with the whole subject of meas- 
uring human behavior, not just with the measurement of individual dif- 
ferences. Therefore, measurement in various kinds of psychological 
research studies is discussed in Chapter 1; an introduction to psycho- 
physics is given in Chapter 2; Chapter 17 specifically concerns measure- 
ment techniques for studies of personality and social psychology; and the 
ethics of performing research studies are discussed in Chapter 18. 
Although the major emphasis of the book is on studies of individual 
differences, these studies are presented in a context of general psychology; 
and as much material as possible has been introduce i : 
methods used in psychological experiments. 
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The third principle upon which this book is based is that the methodol- 
ogy of measurement is the proper subject matter for a tests and measure- 
ments course. Certainly a sound presentation of methodology is of more 
value to the students than a detailed description of particular tests and 
measurement techniques. For one thing, there are far too many tests and 
measurement techniques available to permit the reader to gain an 
acquaintance with even a fraction of them. Then, too, detailed study of 
particular tests and measurement techniques is of rather limited value, 
Since we now know how to construct better instruments than most of 
those that are currently in use. 

The fourth principle is that controversies about human measurement 
should be met head on rather than avoided. Under the guise of eclecti- 
cism, textbooks often hide from the reader the fact that there are differing 
points of view on many issues. If a person does not become acquainted 
with the major controversies in a field and does not learn about the con- 
ceptual and mathematical tools that are used in these controversies, he 
has at best only a shallow acquaintance with the subject matter. Conse- 
quently, the author has chosen to emphasize critical analyses of tests and 
of general approaches to measurement. It is hoped that these critical 
analyses will acquaint the reader with the major controversies in human 
measurement, will sharpen his ability to criticize measurement methods, 
and will help him make the decisions that he will face when applying 
his learning in actual testing and measurement situations. 

The author is deeply indebted to those persons who read and criticized 
portions of the manuscript. Particular thanks are given to Doctors Lloyd 
Humphreys, Ted Husek, Charles Osgood, George Suci, and to Mr. How- 
ard Bobren. The book owes a great deal to the patience and fine secre- 
tarial skill of Mrs. Freda Schell. 


Jum C. Nunnally, Jr. 
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CHAPTER l 


The Foundations of Psychological Measurement 


Psychological measurement is a close and important concern to all of 
us. The student who reads these pages for a course assignment is very 
much interested in one type of psychological measurement: the method 
of grading used by the instructor. The person who aspires to higher edu- 
cation needs the best possible measurement of his capabilities before 
wagering his future. A successful marriage depends on the personalities 
of the partners, and it would be worth a great deal to a young couple to 
have their compatibility measured in advance. Many millions of dollars 
are spent in measuring the public reaction to political candidates, new 
brands of soap, and television programs. In these and many other ways, 
psychological measurement is a part of everyday living. 

In spite of the many needs for psychological measurement methods, 
few had been developed up to the turn of this century. It is surprising 
that men learned so much of the world around them without learning 
more about themselves. A technique was developed for measuring the 
circumference of the earth two thousand years before psychological tests 
came into general use. There are still many weak spots in the storehouse 
of psychological measurement methods. Important decisions must often 
be made about people on the basis of very crude "yardsticks." The slow- 
ness with which psychological measurement has developed is due in part 
to the complexity of the individual human. Some mistaken ideas about 
the purpose and substance of psychology have also retarded the develop- 
ment of measurement methods. These will be discussed in the following 
sections. i 

The Nature of Psychological Data. The development of psychologi- 
cal measurement and of psychology as a whole was delayed, in part, Bs 
some misunderstanding about the nature of the subject matter. The 
philosopher Kant once said that it would not be possible to have a science 
of psychology because the basic data could not be observed and meas- 
ured. Modern psychologists would agree in part with Kant, agree to the 
extent that only certain kinds of psychological phenomena are open to 
observation and measurement. 
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Psychological science concerns itself with the study of human behavior, 
with the actions, judgments, words, and preferences of individuals. The 
study of such tangible behavior complies with the requirements of scien- 
tific research. However, there are other phenomena that are spoken of as 
being “psychological” which cannot be the substance of direct scientific 
study. Some of these are sensations, feelings, and images. 

Throughout the history of philosophical thought there has been a 
tendency to divide psychological phenomena into those that are "physi- 
cal" and those that are “mental.” This division of phenomena is referred 
to as psychophysical dualism, the belief in separate mental and physical 
processes. Because of this unfortunate logical standpoint scholars have 
been drawn into many needless arguments about the "connections" 
between the mental and the physical. 

By adopting a set of simple rules, the seemingly difficult problem of 
psychophysical dualism can be circumvented. The purpose of scientific 
effort is to test statements about the world of events—all those events that 
can be seen, heard, touched, or otherwise experienced in common. An 
illustrative scientific statement is, "Yellow fever is a virus infection." The 
phenomenon being studied might itself be impalpable, such as magnetism, 
atomic action, or the transfer of heat. But our knowledge of the phenom- 
enon must always come from publicly observable events: the changing 
direction of a compass needle, the action of a Geiger counter, or the read- 
ing on a thermometer, 

In order to accept a statement as either true or false, there must be a 
demonstration of evidence in the world of observable events. Only then 
can a statement be tested. The evidence must be such that it can be exam- 
ined by others. The procedures for obtaining the evidence should be 
sufficiently clear to allow independent investigators to gather new evi- 
dence in respect to the problem. These are some of the major rules that 
have been adopted by Scientists, and the rules demonstrate their worth 
by the progress that science has made during the last several hun- 
dred years, 

Instead of trying to pass judgment on the whole field of psychology as 
to whether it is “mental” or “physical,” it is more logical to examine par- 
ticular kinds of statements in psychology to see whether they meet the 
requirements of scientific inquiry. In this way we will find many state- 
ments that are testable, as well as many statements that we will want to 
avoid because they are untestable. 

Some testable and untestable statements in psychology can be illus- 
trated if the reader will imagine that he has in his hand a large, red, 


juicy-looking apple. Take a bite of this make-believe apple. Does it taste 


Pan If you had a friend there with you, would it taste “better” to him 
tkan to you? How much more pleasant 


would be his sensation—twice as 
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much, or three times as much? This is an issue that cannot be studied 
directly. There is no way to measure your sensations directly. They are 
uniquely yours in a way that exempts them from direct observation by 
others. Lift the apple in your hand. Do you have a feeling of “weighti- 
ness” in response to the apple? How “large” is your feeling? This is another 
phenomenon that cannot be observed by others or measured directly. 

There are some issues that could be studied in respect to the apple. 
Which would you rather have—an apple, an orange, or a pear? The way 
in which you respond to the question will demonstrate your preferences, 
and your preferences can be compared with those of other persons. How 
much do you think the apple weighs—7, 8, or 9 ounces? Your judgment 
can be compared with the weight of the apple, and a measure of your 
accuracy can be obtained. 

Your subjective experiences—your feelings, sensations, and desires— 
cannot be observed by others, and as such cannot be subjected to meas- 
urement. Once you do something in respect to your feelings—make a 
judgment, state a preference, or even talk to others about the experience— 
then these behaviors meet the requirements of scientific inquiry, and 
measurement becomes possible. 

It may seem like a trivial point to say that an individual's sensation of 
pleasantness is not open to scientific study but that his report of pleasant- 
ness is legitimate scientific data. However, failure to make this distinction 
forces the scientist into some knotty problems about the nature of his 
data. If it is claimed that the individual’s sensation of pleasantness is 
being measured, then one can always inquire as to the proof that the 
individual really enjoys the apple. The answer is, of course, that there is 
no way to prove that the individual enjoys the apple or, for that matter, 
to prove that there is any sensation at all related to the apple. We have 
only the individual’s word for liking or not liking the apple, and the 
verbal report, in this instance, is the basic datum with which the scientist 
must deal. 

Being restricted to what the individual does, or says, does not fore- 
shadow defeat in psychological work. There is much that can be learned 
about human behavior, and there are many ways in which one human 
action can be predicted from others. For example, the individual's stated 
preferences for fruits would likely be predictive of what he would pur- 
chase or what he would choose to eat. More importantly, the scores that 
a person makes on psychological tests can predict approximately how well 
he will perform in many real-life situations. 

Emotional Barriers to Psychological Measurement. Psychological meas. 
urement efforts are often reacted to as being “nosey” activities—asking 
personal questions, embarrassing people with their shortcomings, and 
experimenting in emotionally laden matters. It gives most of us a twinge 
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of anxiety to see ourselves portrayed as a set of statistics or to have our 
abilities compared with those of others. We like to think of ourselves as 
being unique in a way that exempts us from study. 
Psychology has been held in disfavor by persons who think of it as a 
denial of “free will.” Their argument is that if 
dictable then the individual has no choic 
will. Psychology starts with no 
man but looks instead for the rc 
When such regularities are 
example, a white 
black one of the 


a person's behavior is pre- 
e in his actions, no freedom of 
à priori viewpoints about the nature of 
'gularities that occur in human behavior. 
sought, they are found in abundance. For 
object will nearly always be regarded as larger than a 
same size. Persons who make low scor 
tests seldom manage to complete college training. If you are like most of 
US, you are consistent to some degree about the kinds of clothes that vou 
buy, the types of people that you choose as friends, the 
like to sit in a theater, and many other things that c 
studied. 

Some people think of psychology as an attempt to intrude on religious 
and ethical beliefs. Because of psychology's behavioristic approach, it 
automatically prevents itself from investigating the individual's experi- 
ential world. Psychology can study the differences in ethical and moral 
practice among people but cannot reach decisions about which beliefs 
are more real or more correct. 


The Need for Measurement, Some form of measurement is essential to 
all scientific studies. Without the pr 


es on "intelligence" 


place where you 
an be measured and 


ecision of mathematical language and 
the deductive possibilities which it allows, it would be very difficult to 
express most statements in a way that would allow them to be tested. 
Consider the hypothesis that tornadoes occur on d 
pheric pressure is high and 
would first want to in 
what, how much in an 
rephrase 


ays when the atmos- 
a sudden drop in temperature occurs. We 
quire what is “high” pressure, high in relation to 
absolute sense. This part of the statement could be 
d in terms of barometer readings. The question of “how much” 
would come again in respect to the phrase “sudden drop in temperature.” 
How much of a drop is a sudden drop—10 degrees over a period of 
ele LE would n rement techniques provided by 
a sa Make the original statement more specific. 
A as simple as the fictitious one here 

s. The problem would be all but 
erence to measurement methods. 


eed the measure 
mometers 'to make 
at a hypothesis 
al kinds of measurement 


requires 
to state precisely without ref 


impossible 


MEASUREMENT IN PSYCHOLOGY 
Nen are a number of special fields in psychology. Some of these are 
clinical, social, Physiological, and differential psychology, Research in all 
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areas of psychology is dependent on adequate measurement methods. In 
some areas, the measurement problems have proved more difficult than in 
others. For example, the clinical psychologist finds it difficult to measure 
things like the improvement of patients in psychotherapy and the severity 
of emotional disorders. One of the most difficult problems for psycholo- 
gists is to find Ways to measure important human attributes. 

Individual Differences and Psychological Processes. There are two 
basically different kinds of measurement problems encountered in psy- 
chology. These are the measurement of individual differences and the 
measurement of processes. The study of individual differences, differential 
psychology, is sufficiently broad to be considered a special field. As the 
name implies, differential psychology concerns the ways in which people 
differ from one another. People differ in terms of their walking speeds, 
ability to learn Latin, sensitivity to pain, susceptibility to hypnosis, and in 
countless other ways. Some of these individual differences are of little 
importance in our daily lives; others are of paramount importance for 
success and happiness. 

A psychological process is a series of responses by one person. The 
process of reacting to alcohol is a useful example. First there is a decrease 
in pulse rate, an increased feeling of pleasantness, an increase in reaction 
time, and finally, gross loss of coordination. The characteristic changes 
would be the same for all persons regardless of the initial levels of pulse 
rate, reaction time, and so on. 

Another example of the difference between processes and individual 
differences can be illustrated with a learning experiment. Any person 
who performs learning experiments with animals, say, with white rats, 
finds that some rats learn faster than others. Instead of being interested in 
individual differences among rats, the experimenter is usually more inter- 
ested in the learning process. That is, he would like to determine the 
regularity of improvement and the variables which affect improvement, 
regardless of individual differences. The family of learning curves in 
Figure 1-1 illustrates the problem. 

The curves in Figure 1-1 show the amounts of time taken by four rats 
to reach the goal in a maze on each of eleven trials. On the first trial it is 
apparent that there are individual differences among the rats in the time 
taken to reach the goal. Rat No. 1 took eighty seconds, and rat No. 4 took 
only fifty-five seconds. Also, we see that the order of the rats on the first 
trial is maintained in most of the other trials, showing that some of the 
rats are more capable in the particular type of learning problem. 

A second type of information to be seen in Figure 1-1, and generally 
the most sought-after information in studies of this kind, concerns the 
process of learning tends to be the same for all four rats. Regardless of 
individual differences at each trial, the rats as a group tend to learn in the 
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same pattern. This group tendency can be seen better by looking at 
the curve of average times. This was obtained by adding together the 
times for the four rats at each trial and dividing by four. This curve shows 
that for the rats as a group the learning is much faster during the first 
several trials. The rate of learning for the group levels off after that, and 
only slight improvement for the group can be found after the eighth trial. 


Ficure 1-1. Hypothetical curves show 
reach the goal in a learning maze. 


ji 


ing the amounts of time taken by four rats to 
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individual differences, the experimenter usually hopes that 
among people will be as large as possible. If such differences 
here is nothing with respect to individual differences to study. 
of processes, the experimenter usually hopes that individual 


will be as small as possible, because they only serve to compli- 
cate the regularity that he js seeking to measure. 
Measurement in Differenti 
Particularly essenti; T tial psychol i i 
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cn s psy gy. Differential 
individual differe 


are 


asuring 
al differences 
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measuring individual differences; the second concerns the facts that have 
been established by applying the tools. 

The remainder of the book will be more concerned with the subject 
matter of differential psychology than any other special area, and the 
emphasis will be on the measurement of individual differences. Attention 
will be given to some of the problems in the measurement of processes. 

Although the major emphasis in this book is on the study of individual 
differences, we should not overemphasize the conclusions that can be 
obtained from studies of this kind. There are limits to what can be learned 
about human behavior from individual differences alone. Although studies 
of individual differences have proved very useful, it should be remem- 
bered that it is the study of processes with which psychology is more 
extensively concerned. 

The study of individual differences has a wide appeal in the United 
States. Not only has much of the work in this field of study been done by 
persons in our country, but the subject of individual differences has a 
particular appeal for the American public. There is perhaps no other 
country in which a person can attract so much attention by being differ- 
ent or doing something unusual. There are contests to see who can eat the 
most pies, sit longest on a flag pole, and dance longest without falling 
from exhaustion. 

One of the major reasons we are interested in differences among people 
is that this is a "land of opportunity," as the well-worn phrase puts it. 
There is more opportunity for an individual to use his ability to attain 
fame and fortune than in countries where either the class structure or the 
lower standard of living provides less room for the individual to get 
ahead. If a person is better looking, can sing better, is more intelligent, is 
a better athlete, he may be on the road to great personal gain. Conse— 
quently, people are usually eager to find some way in which they surpass 
others, even if it is in so remote a behavior as being able to hold their 


breath longer. 

As evidence of our interest in individual differences, American maga- 
zines are replete with short, and usually inadequate, tests of vocabulary, 
and many others. The mother is eager to 


popularity, marriage happines j 1 
see that little Johnny learns to walk before the neighbor’s child. She wants 
to have his IQ measured, and hopes that he is a budding genius. 

The study of individual differences has been given considerable atten- 
tion during the last fifty years. One reason for the interest of psychologists 
in this type of work is the success that the testing movement has encoun- 
tered. Some areas of psychology can be more properly regarded as sci- 
ences of the future. The groundwork is being laid, but the present store- 
house of facts is not large. There is also a great deal yet to be learned 


about the measurement of individual differences. The measurement of 
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personality characteristics has proven a particularly onerous stumbling 
block. Even considering the uncharted regions in the measurement x 
human behavior, the testing movement in America has reached an ad- 
vanced stage of technical sophistication and proved practical importance. 


MEASUREMENT SCALES 


In the discussion to follow, we will speak of a "scale" as denoting the 
procedure, or “yardstick,” with which a particular measure is obtained. 
Measurement scales are part of everyday living: the ruler as a scale to 
determine length, the thermometer to indicate temperature, and the 
achievement test to measure success in school. We customarily place 
numbers along scales, such as inches, degrees Fahrenheit, and test scores. 
However, these are not the same kind of numbers; they mean different 
things about "how much," and they have different mathematical uses. 
Some of the major kinds of measurement scales V 
following sections. 

Labels. Labels, or tags, are used to signify that one object or person is 
distinct from another. The numbers on the backs of baseball plavers are 
labels. They tell nothing at all about quantities. The player numbered 44 
is not necessarily twice as good as 22. If we add the labels for these two 
players, 66 is obtained, Does this sum have any meaning? Any other num- 
bers would have served as well, or the players could be distinguished by 
different colors, alphabetical letters, or geometric designs. Labels have no 
mathematical Properties and are not scales at all. Other examples of num- 
bers used as labels are the numbers on theater seats, numbers used to 
designate highways, and the numbering of stamps in a collection. Not 
every set of numbers is necessarily a set of measures, The investigator 
must ensure that the numbers which he studies are really quantities rather 
than mere labels, 

Categories, Categorization, 
mentary type of scientific d 


vill be discussed in the 


or qualitative distinction, is the most ele- 
atum. The biologist makes a distinction between 
insects that have Wings and those that do not. The chemist classifies 
liquids into those that are acid and those that are alkaline. Mental patients 
are categorized into Sroups such as depressive and schizophrenic, Mem- 
bers of the general public are categorized in terms of their preferences for 
political candidates, 

Categorization implies that the 
common and that they differ 
other categories. The « 
lection of rocks. The 
numbering them from 


things in a category have 
in some characteristic from the objects in 
difference can be illustrated with a geologist’ 
geologist might first apply labe 
1 to N, simply as 


something in 


s col- 
ls to his collection, 
a way of keeping track of all the 
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specimens. Later he might categorize the rocks as to whether they are 
sedimentary or igneous in origin. 

It was said previously that no mathematical significance could be 
placed on labels. Although this is a humble beginning, there are complex 
mathematical systems that work entirely with categorical data. 

The number of categories used in particular problems may be only two, 
or, in some studies, the number of categories is large. An example of 
multiple categorization would be the description of people in terms of 
occupations: farmers, plumbers, lawyers, and so on. 

When using categories, no distinction is made among the objects or 
persons within a category. We might think of the members of any cate- 
gory as being equated to the number one—one plumber, one schizo- 
phrenic, or one alkaline liquid. This would then give no information as to 
Whether one plumber is better than another or whether one liquid is more 
alkaline than another. Also, from the categorizations alone, there would 
be no way of knowing that lawyers receive higher incomes than farmers 
and plumbers, or that any relationship exists among the categories. 

The results of psychological studies are sometimes reported in terms of 
categories. There are a number of ways of statistically analyzing data of 
this kind, some of which will be mentioned in the chapters ahead. Chief 
among these procedures is the description of populations of people in 
terms of the frequencies that fall in various categories and the comparison 
of categories for the amount of overlap. Illustrating the former, mental 
hospital patients can be characterized as schizophrenic, paranoid, manic- 
depressive, or other. The latter type of procedure might be used to show 
that the members of one category of mental patients more frequently 
come from home environments of a particular kind than do the members 
of other categories of mental patients. 

Ordinal Scales. The ordinal scale relates persons or objects to one 
another by ordering or ranking them in respect to an attribute. Course 
marks are sometimes reported as ranks. The person who performs best is 
given a rank of 1, second-best receives a rank of 2, and so on to the person 
with the worst performance. As a convention the ordering is symbolized 
aek ZS yora Ne withothe understanding that the numbers mean first, 
second, third... ., Nth. 

The essence of the ordinal scale is the concept of “greater than” por- 
trayed by the symbol (>). Thus a > b means that a is greater than b, or 
that b is less than a. To have an ordinal scale, it must be established that 
a>b>e>d>e>..-->N for N number of persons in respect to 
any particular attribute. The concept of "greater than" characterizing 
ordinal scales is a kind of information in addition to the concept of “dif- 
ferent from" used with categories. 
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An ordering, or rank-ordering, constitutes a scale of measurement, but 
the use of such scales has definite mathematical limitations. The ordinary 
arithmetical operations of addition, subtraction, multiplication, and divi- 
sion make little sense with ordinal numbers. It is not sensible to add 
together the first and third man and equate this in any way with the 
fourth man, as the ordinary operations of arithmetic would lead us to 
expect. 

Two important kinds of information are lacking in the ordinal scale. 
The first is that no information is provided as to how well the group per- 
forms as a whole. If there are six students and no ties in examination 
grades, there will be ranks 1, 2, . . „ 6 regardless of how well or how 
poorly the whole group performs. The ranks might have been determined 
within a group of geniuses, or the ranks might as casily have been 
obtained within a group of only moderately capable students. 

The second important information that is lacking in the ordinal scale is 
that of the dispersion of performance. That is, there is no way of telling 
from the ranks alone how closely the second man approaches the first 
man, or how much better the first man performs than the man who is 
ranked last. If there are two classes and thirty students in cach class, 
there will be ranks 1 through 30 in each class. There is no way of telling 
from the ranks whether the dispersion of ability is larger in one class than 
in the other. The students in one cla might have performed much the 
same, and the students in the other class might have varied widely from 
one another in performance. 

The most important type of mathematical operation that can be applied 
to ranks is that of correlation. In this wa it can be determined to what 
extent persons are ordered alike in two circumstances. It could be deter- 
mined whether the persons who make grades near the top in history also 
make high grades in mathematics and additionally, an over-all numerical 
index of agreement could be found over the extent of both orderings. The 
ability to correlate two sets of measures satisfies the most basic require- 
ment for evaluating tests. Correlational analysis will be discussed in 
detail in Chapter 5. 

Interval Scales. If, in addition to an ordering of persons, a knowledge 
is obtained of the interval or the distance between persons, then an inter- 
val scale is produced. The results of a foot race could be reported in the 
form of an interval scale. The customary procedure is, of course, to state 
the exact running time for each participant. To illustr. 
istics of interval scales, ¿ 


ate the character- 
i sume that the results are reported in terms of the 
interval between the first and second man, the second and third man. and 
So on. It could be said that the second runner came in one second behind 
the first, the third runner came in two seconds behind the second man 
the fourth came in one second behind the third, and so on for all interv: 


als. 
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Like ordinal scales, interval scales also provide no information about 
how well the people perform as a group. Because the results of the race 
were reported only in terms of intervals, we have no information about 
the absolute running times for the men. If we learn later that the first 
runner took ten seconds, then we know that the second man took exactly 
eleven seconds, the third man thirteen seconds, and so on for the other 
runners; but if the first man had taken twenty seconds, then we learn that 
the second man took twenty-one seconds, indicating that the group as a 


whole is much slower. 

The important advantage of the interval scale over the ordinal scale is 
that knowing the interval allows for the determination of the dispersion. 
or spread, of the scores. The range of scores is the interval, or distance, 
between the person who stands highest and the person who stands lowest 
on the scale. Also, the standard deviation of a set of scores can be deter- 
mined without a knowledge of the absolute level of measurements. 
(Measures of dispersion will be discussed in Chapter 3.) 

The arithmetical operations of addition, subtraction, multiplication, 
and division can be used with interval scales only in respect to the differ- 
ences between scores. For example, the interval between the men who 
finish the race first and second could be compared directly with the inter- 
val between those who finish third and fourth. 

Fahrenheit thermometer is an example of an interval 


The ordinary : j à 
scale is arbitrary. It is not meaningful, for 


scale. The zero point on the 3 
example, to say that 90 degrees is twice as warm as 45 degrees. 

Ratio Scales. The final member in the hierarchy of measurement scales 
is the ratio scale. The ratio scale requires that the absolute measurement 
of each person be known. Then in the race that we used to illustrate 
interval scales, we might find that the first man took eleven seconds, the 
second man twelve seconds, the third man fourteen seconds, and so on 
for the remaining runners. Specifying the absolute level for running, or 
for any other attribute, is the same as saying that the zero point on the 
scale is known. 

In addition to its own special virtues, the ratio scale has all the advan- 
tages of the less powerful scales. All of the operations of arithmetic as 
well as the tools of higher mathematics can be applied to ratio scales. As 
the name implies, one score can be divided by another, multiplied by 
another, subtracted from and added to any other. 

Ratio scales are used quite frequently in everyday life. No simpler 
example is available than the weighing of two objects on a “scale.” Com- 
parisons of salaries, heights of individuals, numbers of rooms in buildings, 
and many, many other quantities are treated as ratio scales. It is so cus- 
tomary to deal with ratio scales that it is easy to make the mistake of 
assuming that all measures are of this kind. For example, it might be said 
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that John is “twice” as handsome as Bill. It is difficult in this case to see 
how the ratio “twice” could be obtained, and it is reasonable to suspect 
that a ratio scale is being used where only an interval or even an ordinal 
scale is justified. 

Test Scores as Scales. Most test scores are treated as though they are 
interval scales. Without such an assumption there would be no way of 
obtaining measures of dispersion. As will be shown later, the standard 
deviation and its companion statistic, the variance, are crucial to the 
gathering of test norms and to the determination of how well tests work. 

It is not justifiable to treat most test scores as though thev constitute 
ratio scales. The absolute scores that people obtain are largelv artifacts 
of the Ways in which tests are constructed. For example, a cla instructor 
might decide to give five points for each correct answer, or he might 
decide on ten points instead. Such decisions markedly affect the absolute 
size of scores. For example, suppose that persons a, b, and c make scores 
of 20, 15, and 12 on a test. Then, if we assume that a ratio scale is logical 
in this case, it could be said that a demonstrates 1.33 times as much of 
the particular ability as does b. Suppose that the test were revised by 
adding five items which are so easy that a, b, and c get them all correct. 
Now, the scores for a, b, and c would be 25, 20, and 17 respectively. This 
changes the ratio between the score of a and b to 1.25. We see in this way 
that ratios computed from test scores are not usually meaningful. 

In some instances, psychological measures are justifiably treated as 
ratio scales. Some of these would be the number of trials needed to learn 
a particular task, the length of time taken to respond to a visual signal, 
and the amount of perceptual distortion induced by a visual illusion. 
There have been some interesting attempts to deduce ratio scales for 
certain types of psychological measures, particularly for measures of atti- 
tudes and for measures of human judgment. 


SUMMARY 


Whether or not it is recognized and designated as such, 


psychological 
measurement 


is an important ingredient of everyday life. Systematic 
measurement methods were rather late in coming, and only during the 
last one hundred years has the problem been carefully studied, A major 
drawback to the development of measurement methods has been the 
failure to distinguish the kinds of psychological phenomena that can and 
cannot be measured. Psychological science is concerned with human 
behavior, with the actions, words, judgments, and pre 
all of which are open to measureme 
cerned with purely subjective 
thing about his fee 


ferences of people 
nt. Psychological Science is not con- 
phenomena; until the individual 


i does some- 
elings, there is nothing to measure, 


a 
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We are so accustomed to measuring physical objects and assigning 
numbers to them on the basis of ratio scales that it is easy to assume that 
all measurements are of that kind. However, many measurements, and 
particularly those in psychology, must be made on a cruder basis. Conse- 
quently, it is important to specify the type of measurement scale which 
is in use. This will indicate the kinds of mathematical procedures that 
can be legitimately employed. 

Although we should be impressed with the need for measurement 
methods, this should not dim the importance of simple human observation 
and thought in the search for scientific lawfulness. Measurements are 
helpful to the scientist in explaining and exploring theories, but only the 
human observer can invent theories. No amount of elaborate measure- 
ment can make up for a lack of ideas on the part of the experimenter. 
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CHAPTER 2 


The Growth of Measurement Methods in Psychology 


People have always been interested in the measurement of human 
attributes, and embryonic studies can be traced even to ancient China. 
However, it was only one hundred years ago that the first systematic 
attacks were made on problems of psychological measurement. During 
the nineteenth century, the field of psychological measurement was nour- 
ished by two major influences. First, psychological measurement bor- 
rowed heavily from the concepts and tools that had been applie 
cessfully in physics, chemistrv, and astronomy. The ninete 
was a period of great progress in the physical sciences, an 
soned by men of that day that the methods of physical science could be 
applied to the human mind. It was during this period that psychophysics 
was born: the precise and quantitative study of how human judgments 
are made. Psychophysics has a proud tradition, and it has had a whole- 
some influence on measurement methods in all areas of psychology. 
Because of the importance of psychophysics in its own right and because 
of the impact that it has had on psychological testing, the 
psychophysics and some of the measure 
this chapter. 

The second major influence on the development of psychological meas- 
urement methods was the clinical tradition growing out of medicine, psy- 


chiatry, and social welfare research. In these efforts there was an urgent 
need for methods of measuring emotional stability and intelligence. The 
first practical mental tests grew out of this tradition and were applied to 
the problem of classifying mental deficients in public schools. 

Tt has been only during the last several decades that these two tradi- 
tions have merged. The union of scientific technology and mathematics 
with the clinician's intuitive approach and his concern for people pro- 
duced the modern methodology of testing. 

In addition to the two major influences describe 
other historical trends had 
of psychological measureme 
chological tests in educ 


d so suc- 
enth century 
d it was rea- 


development of 
ment methods will be discussed in 


d above, a number of 
a prominent influence on the development 


nt methods. There are many needs for psy- 
ation, and measurement methods have grown at 
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the same rate that educational facilities have grown. Because of the many 
direct uses for statistical methods in the development and use of psycho- 
logical tests, the early developments in the discipline of statistics were 
prominent influences on the growth of psychological measurement. 

The theory of evolution had a very important influence on psychology 
as a whole and particularly on the field of psychological measurement. 
Although evolution developed in biology and had its most important 
influence there, it helped foster new concepts of human behavior and 
interested psychologists in measuring human adjustment and the inheri- 
tance of psychological traits. 

More recently, the two great wars of this century had very important 
impacts on psychological measurement. In both conflicts psychological 
tests were needed for the selection and classification of armed forces per- 
sonnel. Never before had psychological tests been used in such wholesale 
quantities, and they proved their worth so well that psychological testing 
received wide public acceptance. 

Each of the major trends mentioned above will be discussed in some 
detail in this chapter. Only by understanding these historical roots can 
the modern methodology of psychological measurement be seen in its 


proper perspective. 
PSYCHOPHYSICS 


The Personal Equation. Considerable progress was being made dur- 
ing the eighteenth century in the development of scientific instruments: 
chronographs, telescopes, compasses, microscopes, and many others. Al- 
though scant attention was given to the place of the human observer in 
the use of these instruments, a fortunate mishap stirred interest in the 
problem. At the observatory at Greenwich, England, in 1796, an astron- 
omer’s assistant, named Kinneybrook, found himself consistently at odds 
with his superior about astronomical observations. It was Kinneybrook's 
job to observe the time of transit for certain stars (this consisted, essen- 
tially, of noting the time that a star passed one of the Cross wires on a 
telescope). It was found that Kinneybrook differed with his superior on 
the average by more than one-half second. Although such an amount of 
time might seem trivial for practical purposes, it was sufficient to intro- 
duce serious error into astronomical calculations. The astronomer encour- 
aged Kinneybrook to be more accurate in his observations; but in spite of 
his efforts, Kinneybrook retained his characteristic difference in observa- 
tion time (see 2, chap. 8). 

Astronomers as well as other scientists became interested in Kinney- 


brook’s difficulties. Astronomers began to compare their measurements 
with one another and found that there were consistent differences among 
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them. It became necessary then to determine an individual's “personal 
equation,” his characteristic tendency to over- or underestimate ase 
tions by a certain amount. This brought a recognition that people differ 
in their judgments, and that such individual differences can be measured 
and accounted for in scientific work. 

The Limen of Sensation. Soon after interest had been focused on the 
personal equation, philosophers and the early. psychologists began to 
speculate about the threshold of awareness, the limen. The limen is the 
point at which a visual object, or stimulus of any kind, comes into aware- 
ness. If you sit at late afternoon looking at the skv and wait long enough, 
stars will become visible. One minute you do not see them and the next 
minute vou do. It is interesting to spec 


ulate about the precise moment at 
which the stars come into awarene 


ss, the absolute limen. Another exam- 
ple of the absolute limen can be made with the use of a watch. If you hold 
a watch at arm’s length, you will not be able to hear the “ticking” (unless 
you have an unusually noisy watch). As you draw the watch closer, there 
is a point at which you can hear the ticking, which is the absolute limen. 
In addition to the absolute limen, there is a difference limen, the differ- 
ence in stimulation which makes the individual aware of a difference. The 
difference limen can be illustrated. with 
Assume that you have two light bulbs, which we will refer to as S, and 8, 
The brightness of S, is fixed at a particular level and is not changed 
throughout the experiment. Let the current be adjustable on S, so that the 
intensity of light can be varied, and then make S, the same brightness as 
5, If you increase the intensity of S,, making it gradually brighter, there 
Will be a point at which the subject will become aware of the difference 
between S, and 8, The difference in intensity of the two lights at that 
point is the difference limen, or as it is sometimes referred to, the JND, 
the just noticeable difference. Another kind of difference limen can be 
obtained by making S, and S, very different in intensity 
the experiment. Then S, can be 
can no longer see 


a hypothetical experiment. 


at the beginning of 
made gradually dimmer until the subject 
a difference between S, and S,. When the difference 
limen is obtained in this way, it is referred to as the JNND, the just 
noticeable not difference. The JND and the JNND are usually not quite 
the same in experimental findings; therefore, the experimenter often 
averages these to obtain a better estimate of the difference limen, 

During the first half of the nineteenth century, consider; 
made of the limen: studies of the sensitivity of touch, hearing, vision, and 
other sense modalities. One of the prominent persons in this work was 
Ernst Weber, a German physiologist. One of Weber's principal findings 
Concerns the size of the difference limen at various levels of stimulation. 
If in the judgment of weights, a weight of 6 pounds is just noticeably 


different from a weight of 5 pounds, would a weight of 11 pounds be just 


able study was 
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noticeably different from a weight of 10 pounds? To carry the question 
further, would a weight of 101 pounds be “just noticeably different” from 
a weight of 100 pounds? Your everyday experience tells you that this is 
not the case, that the difference limen is dependent on the level of 
intensity of 8, 

Weber found that the JND tends to be a constant ratio of the 
intensity of 8,, If, in terms of the example above, a weight of 6 pounds is 
just noticeably different from 5 pounds (the JND is 1 pound at this level 
of intensity for S,) the ratio would be 1:2. Then we would predict that 
the JND for an S, of 10 pounds would be 2 pounds (a weight of 12 pounds 
would be just noticeably different). The JND can then be expressed in 
the form of an equation as: 

JND = cS, 


The above equation, which is referred to as Weber's law, is approximately 
true for many different situations involving judgments. It has even been 
applied analogously to such situations as the willingness of an individual 
to invest monev. For example, an individual might be willing to invest 
$100 in a radio for a $3000 car but would not be willing to spend that 
much on a radio for a $2000 car—the willingness to invest more money 
tending to be a function of the amount of money alreadv invested. 
Gustav Fechner. In the hands of Gustav Fechner, the rising interest in 
human judgment became the cornerstone for modern psychological 
Measurement. Fechner went to the University of Leipzig when he was 
sixteen years old and remained there, studying, then teaching and doing 
research until his death seventy years later. Fechner went through an 
amazing gamut of careers. He took his university degree in medicine, but 
his subsequent livelihood and reputation were based on work in physics. 
He sought accomplishment as a philosopher; and, almost in spite of him- 
self, he became a great psychologist. Fechner was what we might refer to 
today as an “odd ball." His philosophical speculations concerned the pres- 
] in all things, even imputing consciousness to plants 
s. He wrote numerous pamphlets under the pen 
heraldic tone the pamphlets called on the sleep- 


ence of consciousnes 
and inanimate object 
name of Dr. Mises. In a 
ing world to arise to the new philosophy. ies 

Fechner tried to support his metaphysical beliefs with scientific experi- 
ments. He seized on the studies of human judgment. and particularly the 
JND, as the starting point for his work. His research was based on the 
Postulate that sensation cannot be measured directly but that it is legiti- 
mate to ask an individual whether a sensation is present or not and 
whether one sensation is more intense than another. This is still the 
logical foundation of psychophysics. Fechner SER out ona. long program of 
research on human judgment, and in the course of this, he invented the 
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major methods of measurement and many of the procedures of analysis 
which are used today. He performed the now classical studies on lifted 
weights, visual brightness, and the sense of touch. Although he never 
proved his philosophical points, he demonstrated how the logic and 
methods of science could be used in psychological measurement. We will 
look at some of the methods that were cither developed by Fechner or 
were suggested by his work. 


PSYCHOPHYSICAL MEASUREMENT METHODS 


The Method of Limits. The method of limits was used on the previous 
pages to explain the difference limen. In using the method a test stimulus, 
Sa, is compared to a variable stimulus Sı. One standard procedure is to 
place S, equal to S/, then increase Sı gradually until the subject recognizes 
the difference. Another procedure is to make S, markedly more intense 
than 8, and then gradually lower the intensity of S, until the subject can 
no longer recognize a difference. 

The method of limits is useful in designing mechanical equipment 
which requires the operator to detect differences in stimulus intensity of 
any kind. A radar scope embodies this principle. The operator must 
detect differences in amount of light on the scope. The equipment should 
be designed in such a way that the operator can most easily detect distant 
objects, such as a formation of airplanes. 

The Method of Average Error. This psychophysical method differs from 
the method of limits in that the subject, instead of the experimenter, 
varies Si and the task is to equate two stimulus intensities rather than to 
judge when a difference is present. One procedure is to place S, at a 

markedly different intensity from 
Sı. The subject is asked to adjust S, 
until it appears to be the same as S;. 
The classical use for the method of 
average error is in the study of vis- 
ual illusions, Figure 2-1 shows the 
famous Müller-Lyer illusion. 
In Figure 2-1 the arrowhead in 
Ficure 2-1. An apparatus for measuring the middle, 2, is midv 


way between 1 
the effect of the Müller-Lyer illusion. and 3; however, the distance be- 
tween 1 


and 2 appears greater than 
that between 2 and 3. If 3 is attached to a cord and can be Sit to the 
right and left by a subject, the method of average error can be used to 
determine the extent of the illusion. The ar 


rowhead at 3 can be set far to 
the right and the subject asked to adjust it to a point where the two dis- 
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tances appear equal. The experimenter can then measure the difference 
between the two lines to determine the effect of the illusion. 

The method of average error is often used to test depth perception. In 
this case two sticks are located in a frame placed about 20 feet from the 
subject. One stick, S, cannot be moved by the subject. A pulley arrange- 
ment permits the subject to move the other stick, Sj, forward and back- 
ward. The task for the subject is to align the two sticks (make them be 
at equal distances from himself). A person with good depth perception 
can align the two sticks within 1 or 2 inches, whereas some persons cannot 
bring the sticks closer than a foot from one another. 

The Constant Method. Instead of using a variable stimulus, S}, to com- 
pare with a S/, the constant method uses a fixed set of stimuli, Si, Sa, Sa, 
and so on, to compare in turn with one S. In laboratory work the number 
of comparison stimuli ranges from four to more than eight. In an experi- 
ment with lifted weights, six comparison stimuli could be used. If S, is a 
weight of 200 grams, three of the comparison stimuli would weigh more 
than 8, and the other three would weigh less than S. The subject would 
be presented one of the comparison stimuli and asked to judge whether 
it is heavier or lighter than Sj. Next, the subject would be given another 
of the six comparison stimuli and asked to judge whether it is heavier or 
lighter than S,. The subject lifts S, with one hand while holding the com- 
parison stimulus in the other hand. Typically, the comparison stimuli 
would be given to the subject in a random order, to remove any unneces- 
Sary cues for judgment. Also, the subject would alternately lift S, with the 
right and left hand to remove this source of bias. After the subject had 


Ficure 2-2, Psychometric function for the application of the constant method to the 


study of lifted weights. 
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made many such comparisons, perhaps several hundred, the results could 
be pictured like that in Figure 2-2. — 

Figure 2-2 shows, as you would expect, that the per cent of — 
“greater” grows as weights above 200 grams are considered, and ed 
percentage declines as weights less than 200 grams are considerec + 
subject hardly ever says that a weight of 185 grams is greater than S, am 
almost always says that a weight of 215 grams is greater than S,. A curve 
drawn through these points on the graph shows the typical psycho- 
physical function. = 

The constant method can be used in a wide variety of psychophysica 
studies. Not only is it a useful way of 
shape of the 
is of interest. 


determining thresholds, but the 
psychophysical function with different kinds of judgments 


The Method of Rank-order. In this method a number of stimuli ae 
ordered in respect to an attribute. No one of the stimuli is specifically 
designated as the S,. A simple example is to give an individual a nutibsr 
of objects and ask him to order them from heaviest to lightest. The 
method of rank-order is most frequently used when there are no meas- 
urable scales for the stimuli, as would be the 


case if an individual were 
asked to rank-order a number 


of foods in terms of preference. There 
rty of “likability” that could be measured, 
m would be to determine the preference 
There would be no way of determining how 
“accurate” the individual is, as would be possible if lifted weights or 
visual intensities were being studied. 

The method of rank-orde 
and preferences, The subje 
in terms of his interests 


would then be no intrinsic prope 
and consequently, the proble 
ordering for the one person. 


r is used in many studies of interests, attitudes. 
ct is asked to rank-order a number of activities 
a number of acquaintances in terms of prefer- 
ence, or a number of articles in terms of what he would purchase. The 
method is also used in tests of human ability, in which the individual is 
asked to rank-order a number of alternative answers to a question. Also, 
test scores are sometimes reported as ranks, the person making the highest 
grade receiving a rank of 1, the second highest a rank of 2, and so on for 
all the subjects. 

The Method of Pair Comparisons. Like the 
method of pair comparisons is concerned with the relative ordering of the 
members in a set of stimuli, It differs from rank-order in that the subject 
is required to compare the stimuli two at a time, until each stimulus has 
been compared with every other. If we were studying food preferences 
and had as our “stimuli” apples, oranges, peaches, and a number of other 
fruits, the subject could be asked, first, "Which do you prefer, apples or 
oranges?" Next. the subject would be asked whether he prefe 


method of rank-order, the 


rs apples or 


The Growth of Measurement Methods in Psychology 21 


peaches, and so on until apples are compared with each of the other 
fruits. There is a resemblance to the constant method in that apples are 
compared to a number of other stimuli. However, in pair comparisons, 
the subject is then asked to say whether he prefers oranges to peaches, 
and he continues to compare oranges with each of the other fruits. In all, 
the subject must make n(n — 1) /2 judgments, where n is the number of 
stimuli being studied. If there are 10 fruits in the example, there will be 
45 comparisons to be made. 

The method of pair comparisons is usually considered the most exact 
Psychophysical tool for many laboratory studies. It can be used to provide 
much the same information that can be obtained by the other methods; 
and, because of the many judgments which the subject must make in pair 
comparisons, the results are usually more reliable. Also, the method gives 
an indication of the definiteness with which the subject prefers some of 
the stimuli more than others. If an individual were to make pair compari- 
Sons among objects which he preferred equally or which he paged to be 
nearly equal in terms of an attribute, each object would be chosen about 
as many times as those with which it was compared. Such information 
could not be obtained in the method of rank-order, where the subject is 
forced to give a different rank to each stimulus. Due to the laboriousness 
of the pair comparisons procedure. its use is restricted to situations in 
Which it is necessary to obtain rather exact information about a set of 
preferences or judgments. 

The Place of Psychophy: 
look at Fechner’s contributions and some of the i 
which grew from his work. There are numerous other special methods, 
ch are useful in studying particular 


sies in Psychology. We have had only a brief 
psychophysical methods 


variants of the basic procedures, whi 5 
Problems. It is ae to note that, as in nearly al T sily 2 m, 
most current psychophysical studies, 5 no ge y bod 
cerned with learning about individual differences. m ead ^ S DOM 
ut. human judgment and preference that apply to 
mong people appear in psycho— 
ate the general lawfulness that 


is to find laws about 
everybody. In so far as large differences a 
Physical studies, they only serve to complic: 
is being sought. i i . . 

In Ke ad the rich tradition of Cn ac kii 77 al 
goes on in studies of human judgment. the un e R Lun TOBE — 8 
used widely in physiological psychology. stuc ies o et en — " 
Psychology of learning. Our particular eod A D so z 
Psychophysics on the study of individual di oper 4 as y^ an the 
Psychophysical methods used now in studies o dee id. ifferences, 
but the frm logical foundation of human e pite p ai brought 
over into the development of psychological tests, We wil! not go: inte 
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detail about psychophysics, per se, in the pages ahead (see 5 for a detailed 
description of the psychophysical methods). An introduction to the sub- 
ject was given here to show the wide range of problems that require 
psychological measurement methods and to place in perspective the atten- 
tion which will be given to the measurement of individual differences in 
subsequent chapters. 


THE GROWTH OF STATISTICS 


The construction and use of psychological tests are intimately cone 
nected with statistical concepts and statistical techniques. Therefore, it 
would be helpful to discuss some of the foundations for statistical methods 
before they are applied to testing problems in subsequent chapters. 

The impetus for the development of statistical methods came initially 
not from the budding scientific disciplines but from the needs of gamblers. 
The gamblers needed some principles for the setting of betting odds in 
roulette, dice, and card games. Men such as Bernoulli, De Moivre, 
Laplace, and Gauss laid the foundation for our modern systems of mathe- 
matical statistics during the time between the middle of the seventeenth 
and the middle of the eighteenth century. 

The central notion in the discipline of statistics is that of probability 
Previous to the development of statistics, scientific laws had been ex- 
pressed in a way that allowed no room for errors in prediction. The law 
of falling bodies, the predictions of celestial movements, and the laws of 
heat exchange were intended to predict events in natare exactly. How- 
ever, in actual experiments the results seldom come out exactly as Pre- 
dicted. Such inconsistencies were explained in terms of the inability to 
achieve purified laboratory conditions, which require complete vacuums, 
noncompressible fluids, completely constant temperature, and other ideal 
circumstances, Later it came to be realized that such ideal conditions 
could never be obtained, and consequently, that the results of experi- 
ments could be only approximately predicted. In some fields of research 
we have seen a succession of laws, each of which came closer to explain- 
ing a scientific problem. Scientists have come more and more to look 0? 
all laws as approximations, approximations that fit the real world S 
closely as to be “true” for all practical purposes. l 

The need for a consideration of probability is more necessary in deal- 
ing with some phenomena than it is when dealing with Ollie. W rela- 
and crude scientific principles will serve us in everyday life, in problems 

uch as constructing a canoe to hold a certain amount of weight setting 
ki a system of gears to lift tree stumps, and determining the amoun 
dec Lei STE i rt TU 

a s of probability. The individua 
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flip of a coin or throw of dice will obey not even an approximate lawful- 
ness. With events like these, it can only be said that one outcome is more 
likely, or more probable, than another. Because of the randomness of the 
individual event, statistical laws are true only on the average. For exam- 
ple, if we flipped a coin 1,000 times, we would expect to obtain approxi- 
mately 500 “heads” (although a “biased” coin might fool us here). 

In many scientific situations, predictions can be made only about aver- 
ages or total effects rather than about individual events. For example, the 
laws of gases are true only on the average. It can be predicted how a 
Easeous body will behave under changes in temperature, but it cannot be 
Predicted how the individual molecule will behave. As the temperature 


is increased, the molecular movement increases, on the average. However, 
because of collisions with 


sc A 
hd of the molecules might slow up for a time 
others, and some of the molecules might accelerate beyond the average. 


int when the behavior of nearly countless molecules is averaged out, pre- 
c icti T 3^0 - ^] 
ictions become very accurate for all practical purposes. 


The social sciences are particularly dependent on concepts of prob- 
about the behavior of “mobs” might hold 


as a whole. It might be predicted that the 
as a whole would increase. However, 
pothetical principle, it is quite likely 
ild actually become quieter as the 


mee For example, a principle 
amount the action of the group 
Siena of shouting for the group 
that € some credence for this hy 
gen aes of the group members wo 
I eral noise level increased. 
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bien of error which such pre 

ins confronting the seventeen 
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erage rmance for a number of 
it is equally important to learn about 
dictions entail. Thinking back to the 
th-century gamblers, the errors in- 
, serve to illustrate the problem. 
hat are the odds that eight 
coins Were tossed 1,000 times, how 
tails? The expected occurrences for 
Table 2-1. We would expect to 
in 1 toss out of 1,024 tosses. As 
ring pattern is 5 heads and 


perfo 


t. d in the prediction o 

of mos MNT tossed 10 coins ae 
ma a would be heads? If the 1 

hy times would all ten of them be 
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aped curve in Figure 2-4 is encountered Very often in prac- 
is referred to as the normal distribution, or it is variously 
called the normal curve and the normal Curve of error, Much of the early 
development of statistics Was concerned with the Properties of the normal 
distribution, 

Quetelet and 
numerous effort: 
chavior of 
Statistica] m 
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tician, 
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s have been made to apply statistic 
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nth century, Adolph Quetelet, a Belgian statis- 
ather a considerable amount of information about 
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European populations. He collected information on the number of chil- 
dren in families, the number of children born in different years, and the 
physical characteristics of people. He found that many of these character- 
istics distribute themselves in a shape much like the normal distribution. 
For example, he found that the chest measurements of soldiers were dis- 
tributed like the normal distribution. 

Because Quetelet knew that the normal distribution demonstrates the 
error in many types of predictions, he came to the conclusion that a nor- 
mal distribution of human traits shows "nature's error" in the composition 
of human beings. It was his idea that "nature" aims at producing the 
average man, and that the extremes in either direction are "accidents of 
nature." We remember Quetelet not so much for this principle as for the 
influence he had on men to follow. He was one of the first persons to 
make systematic studies of individual differences, and he interested others 
in the application of statistical methods to the study of human behavior. 


THE EFFECT OF THE THEORY OF EVOLUTION ON PSYCHOLOGY 


Although the theory of evolution had its most direct influence on the 
Science of biology, it also had an important impact on psychological meas- 
urement and psychology as a whole. Before the nineteenth century, the 
predominant view was that man is a static being, having since the day 
of creation a uniform and unchanging set of physical and mental attri- 
butes. Although Quetelet measured individual differences, he thought of 
these as “nature’s mistakes” in producing the average man. It had always 
been noted that men differ from one another, but there had been no sys- 
tematic attempt to study the part which individual differences play in 
everyday life. : 

Darwin. There were dissenters to the static theory of man and his sur- 
roundings, and there were even some, men such as Lyell and Lamarck, 
who theorized that life changes from generation to generation. It re- 
mained for Charles Darwin to amass the evidence to support the theory 
of evolution, He collected many examples from the animal kingdom show- 
ing the changing forms of life that develop in special environments. He 
saw that the mechanism of selection, or survival of the fittest” as he called 
it, allows some animals to exist and forces others to die off. Therefore, the 
animals whose particular features make them more suited to an environ- 
ment survive and pass on their genetic qualities to the next generation. 

Darwin’s major evidence for evolution was given in The Origin of 
Species published in 1859. It brought on several decades of argument 
among scientists as to the validity of the theory and the implications of 
the theory for scientific work. As the theory came to be accepted, it 
forced new viewpoints on scientific disciplines. It required biologists and 
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botanists to think of animals and plants as being in a state of change, 
gradually progressing from one form to another to mect the changing 
environments. The theory also gave an impetus to the study of Toten 
differences in psychology. It was reasoned that, if individual e 
in plants and animals make for differential ability to adapt and survive, 
individual differences in humans would also have a functional signifi- 
cance. Further, if plants and animals inherit ancestral characteristics, 
some of the individual characteristics in humans might be accounted for 
on the same basis, 


EARLY STUDIES OF INDIVIDUAL DIFFERENCES 


Galton. The several lines of thought that we have traced—psycho- 
physics, statistics, and the theory of evolution—converged in the psy- 
chological work of Sir Francis Galton. 
“breed,” the gentleman scholar who, without formal university connec- 
tions, dabbled effectively in a wide range of subjects. His interests ranged 
from criminology, psychology, biology, and anthropology, to the inven- 
tion of numerous scientific devices. He invented the "supersonic" whistle 
and delighted himself by calling the dogs of London away from their 
incredulous masters, He invented composite photography, a procedure 
for combining a number of photographs into an average human appear- 
ance. He applied the method to the photographs of criminals, trying to 
obtain a composite of typical crimina] features, He once practiced para- 
noia, going about imagining that everyone was talking about him. After 
several days he could imagine that even horses were plotting against him. 
In order to understand primitive religions, he built a religion of his own. 
He placed a comic picture of Punch on the wall and paid it all manner of 
homage and tribute in order to grasp the meaning of the idol in primitive 
worship. In addition to his many other activities, Galton also applied his 
genius to the study of psychological traits, 

Galton was Darwin's immediate follower in the field of psychology. He 
developed an interest in the hereditability of individual characteristics 


and to support his views. In 
reditary Genius, in which he tried to 
amilies. He came to believe that most 
i aracteristics are inherited, not only the prominent 
physical characteristics, but also abilities and personality characteristics. 
He believed that criminality and psychological disorders are passed along 
from father to son by direct inheritance. 


Galton founded the study of eugenics, the 
the betterment of the human race 
Galton neve 


He was a member of a vanishing 


show that “greatness” runs in f 


ating. Although 
program would 
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involve, he was convinced that a world of superior humans could be 
developed by encouraging gifted couples to marry and by discouraging 
or preventing less gifted individuals from having children. 

Galton realized that, before his program of eugenics could be under- 
taken, it would be necessary to learn more about the inheritance of indi- 
vidual characteristics. If it could be shown that a particular attribute, 
literary ability for example, is passed from father to son, then it should be 
possible to breed a generation of literary geniuses. However, if the attri- 
bute proved not to be inherited, there would be nothing that a program of 
eugenics could do to improve literary talent. Galton needed first to meas- 
ure human characteristics before proceeding to study their hereditability, 
and it is because of this need to measure individual differences that Gal- 
ton’s contributions are of such importance to us here. 

Galton coined the term “mental test” and set about to measure many 


different human attributes. He recognized the need for standardization in 
sented the same problems under 


testing, that all subjects should be pre 
uniform conditions. Galton’s tests bore little resemblance to the ones 
which are most widely used today. He and his immediate followers in 
England made much of the philosopher Locke’s dictum that all knowledge 
comes through the senses. They reasoned from this that the person with 
the most acute “senses” would be the most gifted and most knowledge- 
able. Most of Galton’s tests were measures of simple sensory discrimina- 
tion: the ability to recognize tones, the acuteness of vision, the ability 
to differentiate colors, and many other sensory functions. Galton also 
invented many instruments to use in his research, some of which are used 
in modified form today. : 

Galton began the first large-scale testing program at his Anthropometric 
Laboratory in the South Kensington Museum in 1884. Each visitor was 
charged threepence for having his measurements taken on a variety of 
physical and sensory tests, including height, weight, breathing power, 
Strength of pull, hearing, seeing, and the color sense. In order to analyze 
the data obtained, Galton made use of statistical methods. With these, 
he determined averages and measures of variation. His particular need 
was for a measure of association, OF correlation, to detect the amount of 
resemblance between the individual characteristics of fathers and their 
sons. He made the first attempts to derive the statistics of correlation, the 
procedures which are n idely used in the study of individual 


ow wi 
differences. 


Pearson. Galton supported 


a younger colleague, Karl Pearson, in the 
development of statistical methods for the study of individual differences. 
Pearson was the genius in mathematical statistics that Galton was in 
. erences. He derived the correlation coefficient 

> 


studies of individual diff t : 
partial correlation, multiple correlation, factor analysis, and laid the foun- 
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dation for most of the multivariate statistics which are in use in psychol- 
ogy today (some of which will be scen in the chapters ahead ). 
Following the work of Galton, the last two decades of the nineteenth 
century saw a considerable expansion of studies of individual differences, 
in England and America. Prominent in this work was James McKeen 
Cattell, who studied with Galton and then undertook investigations in 
America. The tests continued to be largely of 


a sensory and motor type, 
measuring sensory speed and reaction time. Toward the end of the cen- 
tury, it came to be re; 


alized that these tests were not measuring intelli- 
gence. Scores on the tests bore almost no relationship to measures of 
intellectual achievement, such as school grades, refuting the hypothesis 
that sensory ability is closely related to intelligence. 


THE FIRST PRACTICAL MENTAL TESTS 


A brief Summary will be given of the m 
logical testing during this century. E 
will be discussed in detail in lat 

At the turn of the twentie 
of psychological measure 


ajor developments in psycho- 
ach type of test which is mentioned 
er chapters. 

th century, some notable events in the history 
ment occurred in France. It was there that the 
humanist tradition and the interest in medicine and social welfare were 
centered. After the French Revolution, terms such as “equality” and “the 
rights of man” showed the new interest in helping the downtrodden, the 
sick, and also the Psychologically maladjusted. Medicine developed rap- 
idly during the first part of the nineteenth century, partly because of the 
Napoleonic Wars and the need to treat wounded soldiers. Pinel released 
the insane from their dungeons and insisted that they were sick, not pos- 
sessed by demons, Charcot, Janet, and Ribot founded the field of psychi- 
atry and developed the first plausible theories of psychopathology. Freud 
drew his early knowledge from these men and went on to be the founder 
of Psychoanalysis, 

Binet. During the last quarter of the nineteenth century, the distin- 
guished line of French investigators was joined by Alfred Binet. His early 
work was on the use of hypnotism in treating mental disorders; then near 
the end of the century, he turned to the 
Sence. He was commissioned by the 
struct tests for the measure i 
schools had become al 
means of distinguishir 
lacked the fund 


problem of measuring intelli- 
Minister of Public Instruction to con- 
ment of intelligence in school children. The 
armed by the number of failures and needed some 
ng the lazy or maladjusted from the children who 


aboration with Simon, completed his first test in 
1905. The tes as not principally concerned with sensory and motor 
been the case with the tests of Galton and his immedi- 
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ate followers. Instead, it concerned the child’s ability to understand and 
reason with the objects in his cultural environment. Consequently, the test 
items were such as the naming of objects, the completion of sentences, 
and the comprehension of questions. 

The first Binet-Simon test had a range of questions from very simple 
ones, which even the dullest child might answer correctly, to questions 
which would indicate a superior level of ability. Practical work with the 
test indicated that some of the items were more difficult than had been 
anticipated, and some of the items proved to be insensitive to differences 
in ability. A revision of the test was made in 1908, in which the items were 
arranged in terms of age levels, with a group of items being designated as 
representing average intelligence for each particular age. The highest age 
level at which a child could perform adequately was called his men- 
tal age. Later, Stern suggested that the mental age be divided bv the 
child's chronological age, which is essentially the IQ as it has come to be 
known. 

The Binet-Simon tests aroused considerable interest in other countries, 
in England and especially in America. Several translations and revisions of 
the tests were made in America, including forms by Goddard (4) and 
Kuhlmann (6). The tests became very popular in studying school chil- 
dren, particularly in the understanding of psychological maladjustment 
and juvenile delinquency. The most prominent revision of the Binet- 
Simon test was completed by Terman (9) in 1916, the resulting form 
being referred to as the Stanford-Binet test. For the first time a consider- 
able amount of research was done to select items which would be appro- 
priate for each age level and which would differentiate age levels. Also, 
the large number of children which was tested provided norms for the 
interpretation of test scores. 

Terman and Merrill presented a revision of the Stanford-Binet in 1937 
(10). A truly heroic effort was made to construct a standardized measure 
of general ability. Several thousand children were tested in all parts of 
the United States to establish norms. This test has been used more fre- 
quently than any other comparable instrument and is still in wide use 
today. (The Stanford-Binet will be discussed in detail in Chapter 10.) 


THE FIRST GROUP TESTS OF ABILITY 


The Binet forms and their revisions are all individual tests, requiring 
an expert examiner to test one child at a time. During the First World 
War there was a need for classifying large numbers of recruits, to weed 
out men whose level of ability was so low that they would be a handicap 
to the service, and also to select more intelligent individuals for positions 
of responsibility. A special committee of the American Psychological Asso- 


30 Tests and Measurements 


ciation, with R. M. Yerkes as chairman, began an investigation of testing 
x or the armed services. " : CHE 

DS had been working on a group test of “intelligence, Leo» 
he turned over to the committee. With some revisions, it was pat bes 
This test is referred to as the Army Alpha Test (7). It was a ae 
containing questions on arithmetic, reasoning, and Pu i ee 
The test proved very successful, came into wide use in the ies d 
and is still used to some extent today. A companion test, the d 5 i 
(8), was constructed for use with individuals whose Perd e : a 
lish is scanty. The test instructions are given in pantomime, anc i * 
items do not require the subject to use or understand English T : 5 
requires the subject to solve geometrical puzzles and to analyze DEDE 
Although the Beta test was not so successful as the Alpha test, it se 
stage for the nonlanguage tests which are in use today. 


STANDARDIZED CLASSROOM EXAMINATIONS 


The Essay Exam. The standardized classroom examination as we know 
it today came into wide use only during the last fifty years. Before 1 55 
time it was customary for the instructor to give grades on the basis of his 
accumulated impressions of the student or on the basis of oral Senin 
tions, Grading in this way can be unfair to the student who is not popular 
with the teacher or who does not have the verbal facility to impress 
others. The first standardized examinations were of the “essay” type, in 
which the student writes about a number of problems or topics. Although 
grading is still somewhat subjective, the "essay" examination is standard- 
ized to the extent that all students receive the same questions, providing 
à uniform basis of comparison. 

The Multiple Choice Examination, P. 
has moved in two directions, in the 
in the development of multiple cho 
choice, or "objective" test, 
ability of instructors? impr 
use of multiple choice teg 
formly fair to all students, 

Standardized Achiev 


Togress in classroom examinations 
refinement of essay examinations and 
ice tests. The advent of the multiple 
as it is sometimes called, ruled out the unreli- 
essions; and although there are abuses in the 
ts, examinations of this kind are at least uni- 

ement Tests. A more recent trend in classroom test- 
ing is the use of achievement tests which are constructed by a group of 
Specialists, instead of the teacher, and are applied to all of the children in 
à school system, This allows a comparison of the children in different 
classes of the same school and comparison among the children in different 
schools. Now the teacher can use standardized achievement tests for 


many different subject matters, ranging from reading ability to knowledge 
of geography, 
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THE DEVELOPMENT OF PERSONALITY TESTS 


In line with the sequence in which tests were developed, we have 
talked so far about tests of ability, which concern how much an individual 
knows, how rapidly he can work, or how well he can solve problems. 
Tests of this kind are sometimes referred to as "cognitive" tests and have 
been contrasted with the variety of attributes which are “noncognitive” or 
not abilities in the strict sense. Cronbach (3) has distinguished these two 
kinds of attributes as being tests of “maximum performance" as opposed 
to tests of “habitual performance.” You can see how these functions would 
differ if we were considering a test for “friendliness.” It would not be a 
case of how friendly an individual could be if he tried, because most of 
us know how to be friendly. The question would be that of how friendly 
an individual habitually is in his dealings with others. Another synonym 
for noncognitive tests or tests of habitual performance is personality tests. 
Although the term has come to be used far too broadly to cover interests, 
attitudes, mental health, and many other matters, it is too firmly in use 
to be avoided. 

Tests of ability were developed first and more firmly not because of a 
lack of interest in personality but because of the difficulty in measuring 
the noncognitive functions. Galton was as interested in the inheritance 
of personality as he was in the inheritance of abilities. However, the 
nearest he came to measuring personality characteristics was in a ques- 
tionnaire study of “imagery,” concerning the extent to which individuals 
can picture objects in memory. Binet was interested in tests of personality 
before constructing the first “intelligence” test, and among other devices, 
he designed a questionnaire test of “morality” for children. 

Self-report Inventories. Psychologists were not only busy during the 
First World War developing “intelligence” tests, but they also tried to 
meet the need for tests of neuroticism and emotional instability. The num- 
ber of recruits was too large to be seen individually in psychiatric inter- 
views. R. S. Woodworth constructed a questionnaire which asked each 
recruit essentially the kinds of questions that would be used in an inter- 
view. The questions concerned the individuals adjustment in the home, 
school, and among friends. It was used primarily to sift out those men who 
needed further clinical study, but not, as similar inventories were often 
used afterward, to provide "measures" of personality. All the difficulties 
involved in the use of this instrument still continue to hinder the devel- 
ality tests: the individual's inability or unwillingness to 
describe himself accurately and the difficulties in finding a valid set of 
items for use with even the most cooperative subjects. 

Projective Tests. During the First World War, a Swiss psychiatrist 


opment of person 
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named Hermann Rorschach was working on a novel line 1 Kip epos 
ment of personality. The Rorschach Test (see 1), ne it a Rer Ars 
known, consists of 10 ink blots which the subject is d ia to à; dee 
and interpret. The test grew out of Freudian and other c 1 5 r i iih 
gies with their emphasis on unconscious motivation and thc ap ; et 
of symbolism. The purpose of the test is to get beyond whit i ig au j ‘ 
knows about himself, and is willing to relate, to the deeper traits 5 
urges which determine his overt behavior. Since (rendu 5 
research, numerous other projective techniques have been dev na P 
spite of the great difficulties in standardizing tests of this kind, P 
Rorschach and its followers have become the mainstay of the clinica 
psychologist. 


THE PERIOD BETWEEN THE TWO WORLD WARS 


As wars have so often been the gloomy milestones in historical devel- 
opments, the direction of the measurement movement in psychology 15 
changed with the two major conflicts of this century. We saw how the 
practical problems of the First World War encour 


aged the development 
of the first group te: 


sts of “intelligence” and personality. The products 
were inexpensive, convenient tests which could be applied to large 
numbers of people. Consequently, group tests came into vogue during 
the 19208 and 1930s and, unfortunately, were often applied uncritically 
to many problems. There Was too little concern for testing the test, too 
little concern for determining the reliability and validity of the measures. 
Teachers often took the results of “intelligence” tests at face value and 
made major decisions about students on the basis of questionable meas- 
ures. The general public became oversold on tests, and the IQ came to be 
regarded as an infallible and fixed mark on the individual. Numerous tests 
of personality purporting to measure “introversion,” “marriage adjust- 
ment,” “happiness,” and many other complex attributes were constructed 
and sold with little research foundation. The 
of tests during that period is not unlike 
occurred in the development of many 
humerous medical discoveries regarded 
to à more limited and more proper use, 


uncritical overacceptance 
a similar overenthusiasm that has 
' other products: we have seen 
at first as cure-alls then later put 


FROM THE SECOND WORLD WAR TO THE PRESENT 


A great conflict again presented needs for me 
However, this time 


the need was on 
men and women w 
a wide range 
meteorology, 


thods of classifying men. 
a much larger scale. Over ten million 
ere called into service, and they had to be allotted to 
of positions—jobs in airplane Navigation, electronics, and 
which had not existed a decade earlier. 
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Instead of having to construct only several group tests as had been the 
case in the previous conflict, psychologists were called on to construct 
numerous batteries of tests. some of them complicated bevond the ex- 
tremes of what the early testers might have imagined. Psychologists were 
brought into the problem wholesale and spent several years in implement- 
ing the largest testing program that had ever been undertaken. In the 
process, the logic and methods of psychological measurement were con- 
siderably extended, and many researches were undertaken which would 
not have been possible without government support. The psychology of 
testing was pushed years ahead by the work which was undertaken in 
conjunction with the armed forces. Much of the work was methodological, 
concerning the best way to measure particular functions, in contrast to 
the pell-mell effort of earlier decades to compose tests without proper 
research foundation. 

Now we find that the people who construct tests and the people who 
use them are more aware of the limitations of particular measures, more 
inquisitive about the research findings on an instrument, and more mind- 
ful of alternative approaches. A healthy skepticism has grown, and with 
it have come promising new lines of investigation in the measurement of 
individual differences. Testing has become a big business, with numerous 
firms devoting themselves to the development and sale of tests. Hundreds 
of psychologists and professional educators spend most of their time as 
measurement specialists. Psychological publications have hundreds of 
articles each year on specialized problems in measurement. However, 
there are still abuses in testing—tests being sold without proper research 
and a naive acceptance by some persons of any scrap of paper which 
purports to be a test. In spite of the remaining abuses and in spite of the 
unsolved problems in many kinds of tests, psychological tests can now 
point to an increased efficiency and marked economy which they have 
brought to schools, the Armed Forces, psychological clinics, and more 


recently, to industry. 
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CHAPTER 3 


The Use of Scores and Norms 


Test results are usually reported in some numerical form, for example, 
the total number of questions answered correctly on a true-false test. The 
numbers obtained in this way, before they are converted to any other 
form, are referred to as raw scores. Raw scores are seldom directly mean- 
ingful without some qualification as to how well other persons do or as to 
the established standards of performance. If Johnny tells his mother that 
he has made 22 in arithmetic and 48 in spelling, the mother might say, 
"That's good work in spelling, but we must do something about your arith- 
metic.” But the 22 in arithmetic might have been the highest score; and if 
the spelling test consisted of 100 words, and the teacher expected the stu- 
dents to know the majority of them, a score of 48 would be an indication 


of poor performance. 
MEASURES OF CENTRAL TENDENCY 


The first thing that we need to learn about Johnny’s performance is how 
well the students as a group performed on the two tests. This will tell us 
whether Johnny is above or below average. To simplify the problem, let 
us assume that there are only 11 students in the class and that they made 
scores on the two tests as follows: 


Arithmetic Spelling 
Johnny 22 48 
Fred 12 52 
Mary 14 49 
Bill 12 51 
Jane 14 55 
Susan 14 52 
Michael 17 50 
Sharon 19 62 
Harry 11 56 
Patricia 15 52 
Eric 20 75 


By looking at the list of scores, it can be seen that Johnny did very well 
in arithmetic in comparison to the other students, but relatively poorly in 
35 í 
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spelling. However, Johnny’s mother would probably not have an Sper 
tunity to look at the list of scores; and it would be a clumsy way to make 
comparisons if it were necessary to pass around the list to everyone who 
was interested in the test results. 

Mode. Some index is needed to let the mother know at once whether 
Johnny did as well as the other children. One such index, called the mode, 
is obtained quite simply by finding the score made most frequently by 
the students. In the arithmetic test, more students made a score of 14 
(Mary, Jane, and Susan) than any other score. The mode is 52 on the 
spelling test. Although the mode is not as useful as some other measures, 
it is sometimes used in testing as a measure of central tendency. 

Median. Another measure of central tendency is obtained by secking 
the score which has an equal number of students above and below that 
point. In the arithmetic test, the median is 14, Of the 11 students, five 
made scores higher than 14 and six made scores of 14 or less. The median 
is 52 on the spelling test, and there are three students whose scores fall 
exactly at the median. If there is an even number of scores, if we show the 
scores for 10 students instead of 11, the median falls between two scores. 
If Fred had not taken the arithmetic test, the median would have fallen 
between 14 and 15. The usual practice is to split the difference and say 
that the median is 14.5. If Johnny’s score of 48 were not considered on the 
spelling test, the median would again lie between two scores, both of 
which are 52. The median would then still be 52, 

The median can be thought of as the score m 
son.” This is different, as will be shown shortly, 
One advantage of the median is th 
would find it relatively easy to e 


ade by the "average per- 


from the üverage score, 
at it is easy to understand. The teacher 
xplain this index to Johnny’s mother, and 
it would prove useful in discussing Johnny’s class 

The reason that the mode and the median 
cannot serve as a basis for the deriy 
needed. If we start off with ce 


standing. 
are not used more is that they 
ation of other statistics which are 
rtain statistical measures instead of others, 
it greatly simplifies the mathematical work to follow; and the mode and 
median are poor starting points. 

Mean. A measure of central tendei 
easy to understand and c 


acy can be obtained w 
an be used in numerous 
developments. The mean is obtained by adding 
and dividing the sum by the numb 
of the symbolism 
made on a test, 
the formul 


hich is both 
other mathematical 
all of the scores on a test 
er of persons teste 
which will prove useful, let X st 
N for the number of subjects, 
à for the mean is 


d. Introducing some 
and for the raw Scores 
and M for the mean. Then 
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The summation sign (X) stands for the process of adding scores, the 
scores of Johnny, Fred, Mary, and so on, for the 11 students. 

The symbolism can be made more specific bv the use of subscripts to 
show the test on which the mean is being obtained. Letting the arithmetic 
test be referred to as test 1, the formula for the arithmetic mean can be 
more completely specified as 

SY 
Mom 
NI 
Different subscripts can be used to designate other tests, such as using the 
subscript 2 in referring to statistical operations on the spelling-test scores, 
and the subscripts 3, 4, 5, and so on, for other tests. 

Applying the formula to the students’ grades, a mean of 15.45 is found 
for the arithmetic test and a mean of 54.73 for the spelling test. A com- 
parison can be made of the three measures of central tendency on the 
two tests as follows: 


Arithmetic Spelling 
Mode 14 52 
Median 14 52 
Mean 15.45 54.73 


The three measures give similar results on both tests. Note that the mean 
on the spelling test differs by 2.73 from the median and the mode. Look- 
ing back at the scores, it can be seen that seven of the eleven people made 
less than the mean on spelling. This is due to Eric’s performance in spell- 
ergent from the others as to affect the mean unduly, It 


ing, a score so div 
this that the mean can cause confusion in the interpre- 


is in situations like 
tation of test scores. 

When either one person or a small group of persons makes scores which 
are markedly higher or lower than the rest of the group, the median is 
often more useful than the mean for interpreting test results. Fortunately, 
such extreme cases as the example above are not often found in practice. 

Deviation Scores. As a first step in transforming raw scores to a more 
useful form, each score can be expressed in terms of its distance from the 
mean. The transformed scores are referred to as deviation scores and are 
symbolized by a lower-case x: 


Xx = X- M 
The formula states that each person's deviation score is obtained by sub- 


tracting the mean from his raw score. The class grades can be transformed 
to deviation scores as follows: 
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Arithmetic Spelling 
x x: 
6.55 6.73 
— 3.45 —2.73 


— 4.73 
7.27 
1.27 

—2.73 

20.27 


Deviation scores tell whether an individual is above or below average. 
Because Johnny’s deviation score in arithmetic is positive, we know that 
he is above the mean; because his deviation score is negative on the spell- 
ing test, we know that he is below the mean. 


MEASURES OF DISPERSION 


Before we can interpret a particular deviation score, we must learn 
how widely the scores are scattered above and below the mean. A devia- 
tion score of 2.00 would represent superior performance if all of the 
Scores are closely packed about the mean. But if there are deviation 
scores that go as high as +100 and as low as —100, a deviation score of 
2.00 would indicate near average performance, Consequently, we need an 
index of the spread, or scatter of scores about the mean in order to inter- 
pret particular deviations. 

The Range. There are various indices of how widely 
tered, or of the dispersion, as it will be called. One very simple index, the 
range, is obtained by subtracting the lowest score from the highest score. 
The highest score on the arithmetic test is 22 and the lowest is 11. This 
gives a range of 11. The range on the spelling test is 27, showing that the 
dispersion of scores is greater than that in arithmetic. The range is a 
quickly obtained and often used index of dispersion. However, it lacks 
some of the properties that are needed for an acceptable measure, It is 
dependent on only two scores, the highest and the lowest. If Eric had not 
taken the spelling test, the range would be only 14 inste; 


ad of 27. Also, the 
range lacks the mathematical properties which can lead to the develop- 


ment of other statistics (a point to which we will appeal quite often in 
choosing Statistical measures ye 


a group is scat- 
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The Average Deviation. An index of dispersion which is dependent on 
all of the scores instead of just two of them and which indicates the posi- 
tion of an individual in a group is the average deviation. As the name 
implies, it is obtained by finding how much the scores deviate on the 


average from the mean, as follows: 


Arithmetic Spelling 
a| Ija] = 
6.73 
2.73 ADe 
5.73 
eu AD — 2.94 
27 
2.73 
4.73 
7.91 
1.27 
2.73 
20.27 AD = 5.29 


The symbol xi indicates that we are dealing with absolute deviations, 


paying no attention to the signs. i f 

One method of transforming deviation scores to a more useful form is 
to divide each of them by the average deviation (AD). This conversion 
will give Johnny a score of 2,23 in arithmetic and —1.27 in spelling. When 
the distribution of scores is normal, scores transformed in this manner will 
lead to a precise indication of where an individual stands in a group. 
However, developing this measure further will not be worth the effort. 
There are more desirable measures of dispersion which can be used; the 
range and AD have been discussed to provide a background for the meas- 
ure now to be developed. ; : 72 

The Standard Deviation. The AD has a serious fault: it is based on 
absolute scores. It is very difficult to work mathematically with absolute 
1 AD is used in some of the early statistical 
lopment of other measures. However, we 
if we seek a measure of dispersion by 
he sum of these is always zero in any 
set of scores. Therefore, equations based on the sum of deviation scores 
"fall apart" and leave nothing with which the „ can work. 

An alternative to using either x scores or |x| apres is beg of working 
with the squared deviations. These will all ne ee e, wie it alse hap- 
pens that they provide an excellent starting P ee - the Pavano of 
many other statistics. The squared deviations on the two tests are 


scores; and consequently, if the 
work, it severely limits the deve 
will also find ourselves blocked 
working with the deviation scores. T 
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The mean square deviation can be obtained by summing the squared 
deviations and dividing by the number of persons who took the test. This 
statistic is called the variance and is symbolized as o”: 

Ex 


N 


o? = 
2 
An even more useful statistic is obtained by taking the square root of the 
variance, which is then called the standard deviation: 


- [3x2 
g = — 
VN 
Applying the formula to the arithmetic and spelling tests, the 


following 
variances and standard deviations are found: 


Arithmetic Spelling 
128.70 , 602.12 
inar SUN 
o = 11.70 54.74 
c= 3.42 c= 7.40 


Subscripts can be used with the variance and st 
to indicate which test is being studied, like 
ard deviation of the scores on a particul 


andard deviation formulas 
using «e; to refer to the stand- 
ar test. 

The standard deviation and variance can be obt 


going through the step of converting from 
follows: 


ained without actually 
raw to deviation scores, as 


s AX =M] 
N 


„ XX. XXV 
* -() 
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Identical results will be obtained by either approach, showing here at an 
carly stage the mathematical maneuverability that is gained by working 
with squared deviations. 


THE EFFECT OF LINEAR TRANSFORMATIONS ON THE MEAN 
AND STANDARD DEVIATION 


If an arbitrary number were added to each of the raw scores in a dis- 
tribution, how would this change the mean and standard deviation? For 
example, we could add the number 5 to each of the arithmetic scores 
mentioned earlier, The new mean would then be 5 more than the original 


mean. The proof of this is simple: 


X 

Mx S N 
x(X+C) 
Misa =— N 


N 
SX 


= MC 


where C stands for any constant, such as 9 . . 
M stands for the mean of the scores after the constant is 
MINO el 


added to each score in turn 

If then the mean of the original scores is 10, and we add 5 to each of 
the scores, the new mean is 15. (The student should work out the simple 
proofs on his own. Not only will they prove relatively | painless ; brt by 
thoroughly understanding the steps in the simple proofs, it will be easier 
to follow more complex statistical developments. ) M 

Next we can determine what happens to the standard dev ation when 
à constant is added to each of the scores. We conl ace Sis wom the 
raw score formula; but since We know that the we ard ; ev "e S the 
Same whether it is determined from raw Or Bev iation scores, it will be 
simpler to work with the deviation score formula. 

x[(x+C) Ml! 


Sip T N 
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What the formula shows is that from each new score, which will be the 
original deviation score plus the quantity C, is subtracted the mean of 
the new scores. The mean of the original deviation scores is zero by 
definition. Then the mean of the new scores is C, so we can rewrite the 
formula as 


which reduces to 


G t = Nis 


This is the same formula used to obtain the standard deviation of the 
original scores. We have then proved that the standard deviation is 
unchanged if we add or subtract a constant from each of the scores. If 
then the standard deviation of a set of scores is 2, and we add 5 to each 
of them, the new standard deviation is also 2. 

Next we can see what will happen if each of the scores is multi 
a constant. The mean would then become 


plied by 


Because a constant term can always be moved to the left of 


à summa- 
tion sign, 


CX 
Mex) = — 


N 


3X 
0) 
= CMæx 


We have proved that if each score is multiplied by 
of the new scores equals the mean of the old score 
then the mean of the original scores is 10 
by 3, the new mean is 30. 

The effect on the stand. 
stant is as follows: 


a constant, the mean 
s times the constant. If 
, and we multiply each score 


ard deviation of multiplying each score by a con- 
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x | 
= CA N 
ze 


The standard deviation of the new scores will be C times as large as the 
old standard deviation. If then the standard deviation of the original 
scores is 2, and we multiply each score by 3, the new standard deviation 
is 6. The student should satisfy himself that the variance will increase by 
the square of the constant. i 

We have taken the time to derive these results for several reasons. The 
mean and standard deviation will be seen many times in the discussion of 
test results, and it is important to understand their properties. Knowing 
how these statistics are affected by linear transformations will allow us 
to make many computational short cuts without having to re-compute 
mean and standard deviation. Also, the simple proofs which are involved 
here demonstrate some of the manipulations which, when applied to 
more complex equations, permit us to derive many useful results. 


SCORE DISTRIBUTIONS 


The Normal Distribution. The standard deviation is most easily inter- 
preted when working with a normal distribution of scores. The normal 
distribution, or normal curve, as it is sometimes called, was mentioned in 
Chapter 2, but no fuller explanation was given at that time. The normal 
distribution refers to the way in which the score frequencies fall. Instead 
of listing test scores, as we did with the arithmetic and spelling tests, the 
same information could. have been given by specifying the number of 
persons who obtained each score. Using the symbol f to stand for fre- 


quency, frequency distributions for the two tests are as follows: 


Arithmetic Spelling 
xi f X: f 
22 1 75 1 
21 0 
20 1 T 
19 1 62 1 
18 0 
17 1 2 1 
16 0 56 1 
15 1 55 1 
14 3 54 0 
13 0 53 0 
12 2 52 3 
1 51 1 
X 50 1 
49 1 
48 iL 
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The frequency distributions are presented graphically in Figure 3-1. 
There are not enough scores to indicate much about the distribution form 
for either test. However, when there are many scores instead of just 
eleven, the shape of the distribution begins to become apparent. If we 
had given the arithmetic test to 200 students, the distribution of scores 
in arithmetic might have looked like that in Figure 3-2. In Figure 3-2 it 


Ficure 3-1. Frequency distributions for arithmetic and spelling tests. 
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Ficure 3-2. Hypothetical frequency distribution of arithmetic scores for 200 students 
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can be seen that most of the scores are clustered around 16, and that 
scores become fewer and fewer as we go either toward high scores or 


toward low scores. 

As the scores in Figure 3-2 are drawn, they resemble the normal dis- 
tribution. The standard deviation tells the number of persons whose 
scores fall in different regions of the normal curve. Figure 3-3 shows 


Ficurr 3-3. Per cents ° of subjects in various regions of the normal distribution. 
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add up to 99.6 instead of 100 because a fraction of 1 per cent of 


° The per cents e 
and below three standard deviations. 


the cases lies above 


that approximately 68 per cent of the persons make scores between plus 
one and minus one standard deviations and 95 per cent of the scores 
fall between plus two and minus two standard deviations. A detailed 
breakdown of the percentages of scores in various 8 ‘OP the ee 
distribution is presented in Appendix 3. It will prove useful in interpreting 
the many statistics based on the normal distribution. p 
Standard Scores. A very useful type of score can be obtained by divid- 
ing an individual's deviation score on a test by the standard deviation of 
the scores. 
=M, 


9 


Where z symbolizes the standard score. : "v 
Applving the formula to Johnny's arithmetic score we fin 


22 — 15.45 


eS uo 
6.55 
— 842 


=AL 
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An individual’s standard score indicates his place in the frequency dis- 
tribution (if the distribution of scores is approximately normal). If a per- 
son has a standard score of 2.1, this indicates that his score is above two 
standard deviations, and that about 98 per cent of the subjects made 
lower scores. Similarly, a person who makes a standard score of less 
than —1.0 is lower than 84 per cent of the subjects. 

When the raw scores on different tests are converted to standard 
scores, they have the same mean, which is zero in all cases. What is the 
standard deviation of a set of standard scores? Because 
which standard scores are obtained, the stand 
standard scores is 1.00. 

Converting raw scores to standard scores makes results comparable 
from test to test. If an individual makes a standard score which is 
positive on two tests, then we know that he did better than aver 
both tests. If he makes a standard score of 1.00 on one test 
another, we know that he performed better on the 

Transformed Standard Scores. For practic: 
to express test scores by a modification of z scores. A teacher might have 
difficulty in explaining to a student’s mother that a standard score of 
zero meant average performance rather than zero performance. Also, the 
negative values which are obtained with standard scores are difficult for 
some persons to understand. Starting with a set of standard scores, and 
knowing what we do about transformations, we can derive a new set of 
scores with any mean and standard deviation that we like. If we desire 
to have the scores in such a form that the mean is 100 and the standard 
deviation is 10, the first step is to multiply all the standard scores by 10. 
Then 100 is added to each of the resulting scores. If it is desired to com- 
pare the scores on two tests which have different means and standard 
deviations, the mean and standard deviation of one test can be made 
the same as the other. It proves much easier to make tr. 
directly on raw scores rather than h 
first. A formula for this is given in Appendix 

Rank-order Scores. In Chapter 1, it was said that psychological meas- 
ures are sometimes expressed in the form of ranks. Ranks are often used 
to describe test scores when the number of subjects is not large. If a 
mother is told that her daughter made the fifth highest score in a class of 
eleven, she can easily understand how well the daughter performed in 
comparison to the other children. 

The scores on the arithmetic 
ranks. Because Johnny made 
Score would be 
rank 3 


of the way in 
ard deviation of any set of 


age on 
and 1.50 on 
second test. 

al purposes, it is often useful 
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and spelling tests can be 
the highest score on the 
come rank 1. Eric would have rank 2 
» and so on. When several persons make 


expressed as 
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Sharon would have 
the same Score, the in- 
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dividual is given the average of the ranks for the group. There are three 
persons who have scores of 14 in arithmetic. They must share the ranks 
of 6, 7, and 8. The mean of these three ranks is 7, and consequently all 
three persons would be given a rank of 7. There are two people with 
scores of 12, and they must share the ranks of 9 and 10. They are each 
given a rank of 9.5. The lowest score, that made by Harry, is then given 
a rank of 11. 

Percentile Ranks. When the number of scores is large, if we are treat- 
ing 200 scores instead of 11, it proves useful to make a conversion of 
ranks. One useful conversion is to express each individual’s score in 
terms of the per cent of scores which are lower. If an individual ranks 
fourth among 200 persons, then 196 persons rank lower. Dividing 196 by 
200, we find that the individual has a percentile score of 98. Because sev- 
eral or more persons will usually make the same score, a modification of 
the basic formula is often used to take account of tied ranks. The formula 
is to divide the number of people below a particular score, plus half the 
vho make the score, by the total number of subjects. 


number of people v : f 2 
‘midpoint percentile ranks. 


Scores obtained in this way are referred to as po : 

Percentile scores, like ranks, are useful in explaining test results to in- 
dividuals who have little background in the statistics of testing. However, 
they are awkward for other statistical manipulations that might be 
needed. Percentiles provide no indication of the distance between scores. 
There might be a wide gap in raw scores between the fenum iaa has 
à percentile score of 70 and the person who has a percentile ous 0 71. 
but this would not be seen from the percentile scores alone. Percentile 
scores underemphasize the differences between scores on me arera 
of the distribution. Looking back at Figure 3-3, you can see tirat only 
about 2 per cent of the cases fall between standard scores of 2.00 and 
3.00. These standard scores would be equal to Nari se of ae 
97 and 99. Although the scores appear to be separated by only a trivia 
amount when expressed in term 
of standard scores there is a Wi 


s of percentiles, we can see that in terms 


de separation. 


Ficure 3-4. Frequency distribution of percentile scores. 
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What is the distribution form of a set of percentile scores? An equat 
er cent of the subjects make scores between 99 and 90 percentiles 
4 between the 89 and 80 percentiles, and so on. Consequently, the 
Fregueney distribution is flat, or rectangular as it is called (see Figure 


sl NORMS 

Although standard scores or percentile scores would give 
dication of where Johnny stands in relation to his 10 cl 
still be a question as to how well Johnny performs in relation to the 
students in other schools. It may be that Johnny is in a particularly bright 
class, in which average performance means high performance in com- 
parison to other classes. It might be that Johnny goes to school in a 
“well-to-do” neighborhood where the children as à group are well above 
the community average. To dramatize the specificity of scores in a par- 
ticular setting, we can imagine the problems involved in testing a class 
composed of geniuses. Because of the mechanics of obtaining standard 
scores and percentiles, half the students would inevitably come out be- 
low average.” But, of course, below average performance for geniuses 
would mean superior performance in a group of less gifted children. 

Normative Populations. In order to understand how well 
forms on a test, it is often necessary to compare his score with the Scores 
made by children in previous years and the scores made by children in 
a range of localities. In order to interpret the test score made by an 
applicant for a particular branch of the Armed Forces, it would be neces- 
sary to compare his score with the scores of many other applicants. The 
larger group to which an individual is compare 
population. 

The collection of people which constitutes 
determined by the use to which the scores will be put. If achie 
test scores are being used to place students in 
programs in a particular city, the norm 
students from all of the grammar schools in the city. If a selection test 
for the Armed Forces is being used to select individuals for pilot training, 
the normative population should contain the individuals who have pre- 
viously applied for pilot training. 
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population may be quite small. To interpret the results of a course 
examination, the teacher may need only to make comparisons with the 
scores made by several previous classes. 

Scoring in Respect to Norms. It is often more meaningful to compare 
an individual's test score with the scores obtained from a normative popu- 
lation rather than with some smaller and less representative group. In 
this case, an individuals standard score would be determined by a 
comparison with the mean and standard deviation of the normative popu- 
lation. Similarly, percentile scores would be determined by the scores 
found in the normative population. Instead of going through the labor of 
actually computing such scores for each new individual, it is easier to 
compose a table which allows a direct conversion of raw scores to stand- 
ard scores or percentile scores. z 

Sampling of Scores. The group of persons actually tested is usually 
small in comparison with the total number of persons in the normative 
population. Seldom will there be test scores for all of the individuals who 


could conceivably be included. Even several thousand children is a small 
i many millions in this country. The norms 


group in comparison with the 1 
then only estimates of the norms which 


obtained from a sample are hd 
would be obtained from the entire normative population. 2 
y n ing g si r constructing norms, it is 
Sampling Bias. When choosing a sample for constru cting 11 8 

1 : ] ONES 7 ja 56 wet 3 A 
important to select a group which is as representative as p of the 
normative population. The sample would be biased if all o the tests 
were given at one school, because, as we said earlier, the particular 
school might be above or below average ; ; á 
If the sample is meant to represent all of the children in the United 
States, it must be ensured that there is a balance of children from all 
that there is a proportionate representation of 


sections of the country, echa en bean 
urban and rural children, and that all races anc ethnic groups are in- 


cluded. 
One Way of ensuring a re 


for the community as a whole. 


presentative sample is to choose the subjects 


, ; ati is might be feasible in 
randomly from the normative population. This might A E. Hd 
establishing norms for the performance of students in one city. It woulc 

ues : av. 500 of the children as a sample. On 
be possible to randomly select, sav, vie . — 
any larger scale it is not feasible to gather a completely random ps e. 
It is at least conceivable that a sample of the children in the United 
Stik T -ned by selecting randomly from all of the millions 
States could be obtained by $ Bran 1 r 
of children. But this would be prohibitively expensive and : 
suming. It might be necessary to go several hundred miles just to test 
One subject who, because of the sampling procedure, 7285 ass from 
à remote region, When, as is usually the case, its MOE Testo e touse a 
completely SIDA sample, there are alternative. sampling methods 

ete. E e 2 P DU NE TEN 8 0 
Which can be applied. Some of these will be discussed ia Chapter 13. 
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Sampling Error. Regardless of how representative the sample, results 
obtained from a sample can only estimate the true norms in the popula- 
tion. If we find the mean and standard deviation of a test in a sample, 
it must be considered how much these estimates are likely to be in error. 
As in all sampling problems, the larger the number of cases (more items 
in the sampling of content and more persons in the sampling of scores) 
the less the estimates will be in error. 

The Sampling Distribution of the Mean. In an ideal situation, it is 
possible to determine the error that will be involved in making estimates 
from a sample. In this situation, it would be necessary to know in ad- 
vance the statistics, here the mean and standard deviation, which would 
be obtained in the entire population. This would be the case where tests 
were given to all of the subjects in a population. Samples of the tests 
could be drawn randomly just to see how much the mean and standard 
deviations of the samples would depart from the actual mean and stand- 
ard deviations which are known to exist in the population. We could draw 
out samples of different sizes to see how much more error is involved in 
the ones which have a relatively small number of people. 

If we drew 100 samples with 50 test scores in each, the sample means 
could be plotted in a frequency distribution and compared with the 
population mean. We would find a distribution resembling that in 
Figure 3-5. In Figure 3-5 we see that the distribution of means is like 


Ficure 3-5. Distribution of sample means about the population mean. 
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Cistribution, with the mean of the sample means very close to 
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the population mean. However, because each sample now contains 200 
instead of 50 persons, we would expect to find the distribution of means 
clustered more tightly about the population mean. Because of the greater 
stability obtained from a sampling of 200 persons, it would be surprising 
to find sample means which deviated by as much as 2 score points from 
the population mean. 

In the same way that a standard deviation can be computed for any 
group of scores, a ‘distribution of means can also provide a standard de- 
viation. We would expect the standard deviation of the means for the 
samples of 50 persons to be larger than the standard deviation of the 
means for the samples of 200 persons. In fact it would be about twice as 
large. 

If, as in our ideal 


viation of the population, 
ard deviation of the means for a particular sample size, without going to 
the trouble of drawing samples and computing means. The standard 

standard error of the mean, 


deviation of the means, usually called the | 
2. T A a2600nraov 5 Ü t exe 
would be very useful in specifying the accuracy of norms. For example, 


if we know that the population mean is 25 and the standard error of the 


mean is 1, then we can tell how rarely means of a particular size could 
be obtained from samples. Working backward, if we know the standard 
error of the mean for a particular sample size and know the mean of 
e likelihood with which the population mean 
mean by any number of score units. 

Unfortunately, in nearly all practical work with norms, only the mean 
and standard deviation can be obtained for one sample. There are statis- 
tical techniques for working backward from sample values to the standard 
error of the mean. A standard error can also be obtained for the standard 
deviation, to determine the extent to which the standard deviation ob- 
tained from a sample is likely to be divergent from the population 
standard deviation. To explain these procedures in detail would take 
us far afield from the central topics of tests and measurements. The pur- 
pose here is to make the reader aware of the existence of sampling error 
and its influence on norms. There are numerous books devoted primarily 
to sampling error (see 1, 2, 3), where detailed procedures are given 
for estimating the accuracy of norms. À 

Age Norms. We discussed previously how percentile scores and stand- 
ard scores could be used to express norms. In some situations it is desir- 
able to express norms in terms of children’s ages. One such set of norms 
could be obtained by testing the vocabulary of children at all ages from 
four to twelve years. For this purpose, a list d ns T varying in 
difficulty could be used. The mean score ends 5 iti för each 
age group separately. A graphic plot of these might look like that in 


example here, we knew the mean and standard de- 
we could determine what would be the stand- 


one sample, we can state th 
will differ from the sample 
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Figure 3-6. Figure 3-6 shows, for example, that seven-year-old children 
de a larger vocabulary than five- and six-year-old children, as would 

a a à ) year l wouid 
be expected. Age norms would prove useful in interpreting the vocabu 
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Ficure 3-6. Mean vocabulary scores of 
children at each age from four to twelve 
years. 


seven-year-old boy has the vocabulary of 


Age norms fit in well with the w 
the progress of children, and the 
pretation of test scores. 

Mental Age. Age norms are 
telligence" tests, such as the St 
with age norms, it can be de 
telligent" than the average 
in this way with age 
child performs 
to have a mental age of sevi 
only at the level of the 
mental age of five, 

The term mental age 
that the score w 
measure of total ability, 
in practice, it is not reas 
be considered inte 
to score much th 
which depend heavily on the use 
Correspondence is f 
partly a function of 


asonable 
lligence. Ey 


the test wh 


employed with some of the 
anford-Binet, By comparing 
termined whether h 
child of his age. 
norms is referred to as a me 
as well as the aver. 


is somewhat misle; 
hich a child make 


Although te 


and understanding 
ar from perfect. "Therefore, 


ich is employed. 


larv score made by a particular 
child. Suppose that he is seven years 
old and has a vocabulary score of 
25. The mean score for the seven- 
year-olds in the normative sample 
is 30. We can then look to find the 
age group which corresponds to a 
score of 25, Although there is no 
age group whose mean falls exactly 
at this point, we can work as though 
the curve were continuous through- 
out all age levels and determine the 
fractional age group that would 
correspond to a particular score. 
Reading from Figure 3-6, it can be 
seen that an age of approximately 
6.5 would correspond to a score of 
25. It then can be said that the 

à six and one-half vear old. 


ay in which we customarily think about 
y are, therefore, very useful in the 


inter- 


"general in- 
a child's score 
e is more or less “in- 
A score which is 


compared 
ntal age. If a six-year-old 


age seven-year-old child, he is said 
en. Similarly, the six-ye; 
average five-ye 


ar-old who performs 
ar-old would be said to have a 


ading. It gives the impression 


s on an intelligence test is a direct 
sts of this kind have worked well 
to believe that they measure 
en though there is a te 
e Same on different intelli 


all that could 
ndency for children 
Sence tests, especially those 
of language, the 
à child's mental age is 
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Grade Norms. A set of norms which is very similar to age norms can 
be obtained by finding the mean scores for children in various grade 
levels. If a standard spelling test were given to children in the fourth 
through eighth grades, the means could be plotted by grades much as 
they were plotted by ages in Figure 3-6. Comparing an individual’s score 
with grade norms gives the teacher an indication of how well the pupil 
is progressing. 

Quotient Scores as Norms. By dividing an individual's mental age by 
his chronological age, a score is obtained which is called an intelligence 
quotient, or as it has come to be known, the IQ. If an eight-year-old child 
does as well on an intelligence test as the average ten-vear-old, then he 
would have an 10 of 125. The formula for obtaining the IQ is then 


MA 
IQ = Gq X100 
where MA stands for mental age 

CA stands for chronological age 

In spite of the wide use and popularity of the IQ there are a number 
of misleading connotations which it bears. Because the IQ is a ratio, a 
ratio of MA to CA, it suggests that a ratio scale is available for intelli- 
gence. Not only is it questionable whether or not the term intelligence 
should be used with the available tests, but it is certainly unwise to con- 
sider IQ scores as constituting a ratio scale. How unwise this is becomes 
evident when we consider the 10 of a child who got none of the items 
correct on the test. Such a child would certainly be retarded, but it makes 
no sense to sav that his intelligence is zero. 

The IQ is misleading when applied to adults. It has been found that an 
individual's score on an intelligence test grows until the late teens and 
then levels off. If a twenty-year-old individual has an 10 of 150, this 
ted to mean that he has a level of ability equal to that 


cannot be interpre 
of the average thirty-year-old. 

The IQ is most correctly considered as a transformed standard score. 
The mean IQ on the Stanford-Binet is approximately 100 (as it should 
be if the test is properly standardized ) and the average standard deviation 
is approximately 16. This is the same distribution that would be ob- 
tained by multiplying a set of standard scores by 16 and then adding 
100 to each. However, if the IQ is to be construed as a transformed 
sary that the standard deviation be the same 


standard score, it is nece y 
for all tests which use such scores. This is unfortunately not the c: 
In other words, if one person makes an IQ of 120 on Test A, and another 
individual makes an 10 of 115 on Test B, it is entirely possible that the 
latter individual has a higher standard score (due to a difference in 
standard deviations of 1Q’s on the two tests). 


se. 
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GRADING COURSE EXAMINATIONS 


“Grading on the Curve.” There has developed an erroneous 1 
that the characteristics of the normal distribution are such as to suggest 
the grades which should be given on course examinations. One way : 
do this is to give the people above 1.5 standard deviations, about 7 pe 
cent of the group, scores of A. This logic can be carried further to fai 
the bottom 7 per cent, or some other per cent corresponding to a section 
of the curve, and to allot the other grades in terms of the characteristics 
of the normal distribution. There is nothing about the normal distribution 
or any other statistical distribution to say what grades should be given. 

The reason that instructors sometimes “grade on the curve” is that it 
simplifies the allotment of grades. The instructor need only allot grades 
to the predetermined percentages of persons and the job is done. In 


some situations this can be justified on the basis that the school, or 


other institution, can retain only a portion of the entrants. It is then 


necessary to fail a certain per cent of the students. Otherwise there is 
little to recommend the procedure. This system of grading takes no 
account of the over-all level of performance of the class, If one class 
performs better than another, this will not affect the percentage of in- 
dividuals who make each grade. The instructor who does his job well, 
who encourages all of the students to do their best, 
adept at getting over the subject matte 
majority of the students above what they 
class setting. It is then unfair to 
predetermined per cent of grades 


“Grading on the curve” encourages an unhealthy set of attitudes among 
students. It makes the student hope that some members of the class will 
do poorly, so that his own grade will look good in comparison. It also 
makes the superior student something of an outcast, because his superior 
performance will make the remainder of the scores fall lower on the 


curve.“ ere is subst; , m 
curve." There is no substitute for the instructor's careful judgment about 
the educational goals of his course 


and the levels of performance which 
merit superior and inferior grades, 

The Influence of Guessing, If 
answers for each question, 
questions correct even thou 
Tf, on a true-false test, the individual were blind 
questions, or if he flipped a co 
probably get about half of the 
for the Armed Forces who got 1 
choice test. How could this ha 


and who is especially 
r can bring the level of the 
would attain in a less fortunate 
"grade on the curve" and give only a 
in each category, 


a test employs 
an individua] will p 
gh he knows nothing 


a number of alternative 
robably get some of the 
about the subject matter. 
folded and answered the 
in to determine the answers, he would 
™ correct, The Story is told of the recruit 
none of the items correct on a long multiple 
Ppen? The most reasonable answer is that 
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he “cheated.” He knew all of the answers and marked incorrect alterna- 
tives on purpose. 

By making seve 
correct for guessing. 
each question, In a true- 
tives, n equals 3, and so on. 


ral assumptions, some formulas can be obtained to 
Let n stand for the number of alternatives on 
false test, n equals 2; if there are three alterna- 
Then, if an individual guesses on an item, 
the probability, p, is 1/n that he will get the item correct. The probability 
that an individual will get the item incorrect, q, is (n — 1)/n. We see 
1. Then it follows that guessing influences test 


that p plus q equals 
scores as shown by the equation below: 


N= R. p(T — Re) 


where R, stands for the number of items that he “actually” knows 

R stands for the number which he gets correct (knowledge plus 

guessing) 

T stands for the number of questions which he answers 
s is that the score which an individual makes 
hich he actually knows, Re and the number 
ntal approximation to R. could be made 
1 item. As the number of alternatives 


What the formula assume 
is a function of the number w 
on which he guesses. An experime 
by using many alternatives for eacl 
is made larger and larger, guessing has less and less effect on R. 
Some manipulations of the formula will show how corrections for 


guessing can be obtained. 
R= Re- = pRe 


R=pT+(1-p)Re 


R= pT A. 
qR. = R- pT 
ES Ling pT 
9 q 


This will allow an estimate of the score that an individual would have 
made had guessing not been an influence; i.e., if there had been an infinite 
number of alternatives for each question. 

Some examples will help to show how the correction for guessing 
works. Imagine that we are studying the scores made by students ona 
40-item test, where 5 alternatives are being used for each item. Then p 
would equal .2 and q would equal 8. Suppose that John 1 5 30 items 
and gets 22 correct. Then the correction for guessing would be 

22 — 9x 30 


| ee eae 


20 


M 
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If Bill tries all 40 items and gets 24 correct, the correction would be 


24 — 2 40 
Ra 


= 20 


Both of the boys “knew” twenty of the questions. John guessed at ten 
of the items and got the expected percent 
of 22. Bill was more willing to guess. Because he guessed on 20 items, 
he obtained a score of 24. The example illustrates the 
the correction for gue sing is most nec 
allowed to attempt as many items 


age correct, achieving a score 


situation in which 
ary, when the subjects are 
as they choose. If a test is constructed 
guessing should be applied to the 
there is usually little reason for 
answer only as many questions as the 
struct them to answer all questions whether they know the answers or 
not. Then T becomes the same for all stu 
more items correct than another, he 
the correction is made. 


on that basis, the correction for 
scores. However, allowing subjects to 
y choose. It is much easier to in- 


dents; and if one student gets 
will maintain a higher score after 


It should be remembered that the corre 
estimate based on the expected effect of 
a coin to decide the answers on a true-false test, the expectation is that 
he will get 50 per cent correct. However, this would be so only on the 
average. Luck might have it that the individual would get all of the items 
correct or none of them correct. The effect of guessing is then not only 
to distort the scores, overestimating the scores of the persons who know 
the least, but it introduces a certain amount of r 
into the scores. This and other sources of error in test scores will be 
discussed in Chapter 6. 


ction for guessing is only an 
guessing. If an individual flipped 


andomness, or "error," 
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CHAPTER 4 


The Evaluation of Psychological Measures 


In these times, it would be difficult to find someone who has not had 
first-hand experience with psychological measures. Course examinations 
in school are examples of psychological measures. The aptitude tests 
which you might have taken in applying for college, for a branch of the 
Armed Forces, or for a particular job are psychological measures. You 
might have taken personality inventories which classify you as an “extra- 
vert” or “introvert,” or seen ink blot and other less direct personality tests. 
You have probably taken sensory tests of various kinds, for example, the 
test for acuity of vision that required in many states for a driver's 
license. Perhaps you have participated in psychological experiments in 
which you were required to place pegs in a peg board, react as quicklv 
as possible to a light signal, or solve a problem in the shortest possible 
time. These are all within the area of psychological measurement. 

Testing the Test. Tests should be held suspect until their worth is 
proved. Tests that purport to measure intelligence, anxiety, and social 
adjustment may not measure those characteristics at all. If there are a 
number of ways in which a measure can be obtained, a method is needed 
for deciding the relative merits of the different approaches. This takes 
us into the subject of test validation. 


Psychological measures differ markedly in the extent to which their 
usefulness is self-evident. On one extreme, tests of auditory acuity have 
a relatively direct meaning and importance. If the tests indicate that a 
person has very low acuity, he would probably fail in a job that required 
good hearing. Also, it is a safe bet that this person would have difficulty 
in receiving phone messages, in working as a telegrapher, and in trying 
to understand radio programs. 

At the opposite extreme, where the meaning of the test results is not 
at all self-evident, are some of the personality tests. What does it mean 
when an individual describes an ink blot as looking like a butterfly? In 
rv to show a connection between the test response 


this case it is necessary 
and what the individual does in daily life. It is necessary to show that 
the test predicts something, in the broad sense of the word. 


dí 
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Starting with the distinction between measures that stand for them- 
selves and those which must prove their ability to predict, the outline 
in Table 4-1 shows how different kinds of psychological measures are 
constructed and appraised. 


SUG at DE WHR 
TABLE 4-1. OUTLINE OF VALIDATION PROCEDURES FOR THE USE or PREDICTORS 


AND A 


Predictors 


Purpose Prediction of behavior 


Demonstration of 
predictive power 


Standard of validity 


Validation 
procedure 


Correlation with an 
assessment (a criterion) 

Construction Hypothesis; collection of 
items; finding the best 
items 


Examples Aptitude tests 

Personality tests 

General intelligence tests 

Special ability tests: 
musical aptitude, art 
judgment, manual 
dexterity, etc. 


Assessments 


Measurement of behavior 


Demonstration of representative- 
ness 


Sampling of content 


Definition: 

a. Actions to be measured 

b. Persons who stand high and 
low in the measure 


Achievement tests 

School examinations 

Job ratings 

Work records 

Work samples 

All criteria for evaluating 
predictors 


Attitude tests 


Interest tes 


Opinion surveys 


The remainder of the book will be lar 
in detail. 


gely devoted to filling in the outline 


EVALUATION OF ASSESSMENTS 


Assessments ! are measures in their 
particular course gives you a final gr. 

2 The reader should 1 
as it is used here from 
chological literature, For example, the term 
w hen the instruments themsel 
term “content validity" to rel 
for the validation of 


the misleading use of the 


fer to the 
assessments, 


term that often appes 
“personality assessment” 
are really predictor tests. Some 
procedures which are desc: 


own right. If the instructor ina 
ade of A, then you have been as- 
be careful to distinguish the me: 


aning of the term “assessment” 
‘ars in the psy- 

is often used 
psychologists use the 
ribed in this chapter 
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sessed, or measured, in your course achievement. When an employer 
promotes one of his salesmen, the salesman has been assessed in regard 
to his performance. When the manager of a baseball team retains some 
of his players and sends others to a minor league, the players have been 
assessed by the manager. In each of these cases, the measure of perform- 
ance is determined by someone who has the right, or the responsibility, 
to decide what is good and bad in the particular situation. The course 
instructor decides that a test of a particular form will best measure what 
he wants to be measured. Perhaps the salesman is promoted because he 
sells more articles than other salesmen, The baseball players might be 
retained in terms of a statistical complex of batting average, number of 
errors, runs batted in, or whatever else the manager chooses to consider. 
An d ssment has an immediate importance, at least to the person who 
is being assessed and who stands to gain or lose from the outcome. 

The Standards for an Assessment. We might disagree with the way in 
which particular assessments are made, arguing, in terms of our previous 
examples, that there are better ways to construct course examinations, 
better wavs in which to decide on the promotion of salesmen, and better 
wavs to determine the effectiveness of baseball players. What evidence 
could we use in our arguments? The most pertinent point to make in 
criticizing an assessment is to question the extent to which the assessment 
represents important types of performance. 

The course examination might be criticized by examining the content 
of the questions and comparing these with the subject matter of the 
course. If the examination is in American history, it might be possible to 
point out that no questions are included on the effect of the mechanical 
revolution on American culture, or other omissions of this kind might be 
noted, Another point of criticism would be to argue that certain of the 
examination questions concern issues which are less important than 
others that might be included. The standard that an assessment should 
meet is that of content representativencss for the achievement which it 


seeks to measure. 

Similar arguments could be made about the representativeness of the 
yardsticks used in assessing the salesmen and the baseball players. We 
might argue with the sales manager that there are other things to consider 
in addition to the number of sales made. It might be pointed out that it 
helps the company more when sales are made of particular new products, 
ones that the company wants to introduce to the public. Another argu- 
ment might be that some salesmen build more good will in their work, 
perhaps forsaking some sales to gain the long-range trust of the client. 
We might argue with the baseball manager that he fails to include some 
important types of performance in his assessments, that factors such as 
how well players stand up in “close” games, how many errors a player 
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causes for the other team, and how much the player contributes to team 
morale should also be considered. f t 

The Sampling of Content. Using the word “content” to stand broac y 
for all of the possible things that could be taken into consideration in 
making an assessment, it is apparent that the content of 
is only a sample of the total content which could be 
sample in the sense that the number of ite 
in the assessment is ve 


an assessment 
considered. It is a 
ms or points of consideration 
ry small in comparison with the many that are in- 
volved in most complex performance situations. The 
conceivably could include every fact st 
given in the class lectures, plus the m 
about the relationships between the f. 

Although the course examination is 
rial, it is seldom a randomly chosen set of questions. If the purpose were 
to select randomly chosen bits of the things that met the student’s eye 
during the semester, it would be sensible to use questions like, “What 
is the middle initial of the author of the text?” “Where was the text 
printed?” “Whose picture is on page 75?" Of course, materials of this 
kind would not be likely to find their way into an examination, and the 
reason is that the sampling is restricted to items that are judged to be 
important by the instructor. We will talk at greater length about the 


sampling of content when classroom examinations are considered in 
Chapter 8. 


history examination 
ated in the textbook, plus those 
any questions that could be asked 
acts. 


à sample of the total course mate- 


Defining an Assessment. The 


sampling of content for an as 'ssment in- 
evitably involves human judgn 


nent and human values. That is why the 
outline in Table 4-1 states that the construction of an assessment begins 


with a definition, a definition of what is desirable in a particular per- 
formance setting. Sometimes the valuing is done by one person only, and 
it can be done arbitrarily if he chooses, The business executive who takes 
the responsibility of hiring and firing, promoting and demoting, may use 
only his own impression of the individual's achievement. In this case, 


the assessment is defined in terms of the personal likes and dislikes of 
one individual, 


In other situations, the assessment is defined cooper 
ber of persons. Some of the school achievement tes 
test might be the joint product of, say. 
try to come to an agreement 
definition, or the stateme 


atively by à num- 
s illustrate this, The 
, a number of history teachers, who 
about the important course material. The 
nt of what is valued, might be explicitly set out 
in outline form and rules might be established as to how the exar 
questions should be presented. Regardless of how 
ment is defined or the extent to which the 
the assessment stands as a definition of wh 
in a particular situation. 


amination 
explicitly the assess- 
essment is a joint effort, 
at is “good” or what is valued 
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An assessment can be defined most explicitly if it is possible to state 
the actions, or behaviors, which constitute successful performance. The 
several illustrative assessment situations which we have spoken of so far 
were defined in terms of actions. For the salesman, successful actions 
consist of selling products. Success for the history student depends on 
answering the examination questions correctly. The successful baseball 
player performs the actions of making hits, stealing bases, and making 
put-outs, When assessments are defined in terms of actions, it is relatively 
easy for people to examine and discuss the assessment content. 

ssment situations, the actions which make for successful 
are difficult to state. What actions are involved in being 
alker in a department store, naval captain, or chef? 
certainly assessed in terms of perform- 


In many as 
performance 
a successful floorw 
Members of these professions are 


ance and some given positions of more and some of less responsibility. 
ese are difficult to define in terms of concrete 
actions, and there are often legitimate reasons for disagreeing about who 
ful and who is unsuccessful in each. It would be hard to defend 
but even though you might find 


ss than desirable, there may be 


Occupations such as th 


is suce 
the salesman who never sold an article; 
the products of a particular chef to be le 
others who relish his fare. 

If the assessment cannot be e 


xplicitly defined in terms of actions, then 
it must at least be definable in terms of persons, in terms of those who 
1 unsuccessful. We would want to know 
he effectiveness of naval captains agree 
he captains. If the assessment rests with 
ach of them could be asked to rate 


are relatively successful anc 
whether the people who judge t 
about the assessment of each of t 
a group of higher-ranking officers, € 1 
or rank-order 50 naval captains as to their performance. If the separate 
ratings agree highly, if Captain Smith is rated high by all of the judges 
and Captain Green is rated as relatively ineffective in his duties, and so 
on for the captains in between, it can be said that there is a reliable 
We would then know that the same or very 
bly being considered by each of the judges, ey eu 
capable of stating what these actions are. 


assessment being made. 
similar actions are proba 
though the judges are not capa 

Criterion Research. If the ordering of people on the assessment proves 
to be reliable, there is some 


to make the assessment more ] i 
some of the actions which constitute successful performance. The be. 
havior of 50 successful naval captains could be studied and compared 
with that of 50 captains who had been judged to be relatively unsuccess- 
ful in their duties. Studies could be made of the way in which the success- 
ful captains organize their commands, the kinds of responsibilitv which 
they designate to particular officers, the systems of communications which 
they set in operation for handling emergencies, and so on for the many 


thing that the measurement specialist can de 
explicit. He can perform research to leari; 
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places in which successful officers might differentiate themselves from 
unsuccessful officers. 


The set of “critical behaviors” or “critical incidents” which are found 
could be used in an assessment rating scale, like the following: 


Yes No 
Does the officer supervise the ship's navigation 
rather than leave it to other officers? 
Does the officer instruct his men about their be- 
havior when off the ship? 
Does the officer try to keep informed on new te 
nical developments? 
Does the officer discipline his subordinates in the 
company of crew members? 
Does the officer consider the grades made by the 
men in specialized courses when recommending 
promotions? 


ch- 


A longer list of actions or behaviors of this kind could be use 
in making assessments. Although such a list of critic 
not be completely satisfactory to all of the 
assurance that important actions were 
would be at least 


d by judges 
al behaviors would 
judges nor would there be 
not being overlooked, the list 
a start in making the assessment more explicit. 


EVALUATING A PREDICTOR 


The score that an individual make 


5 on à predictor test has no necessary 
importance in its own right. Consid 


er the following type of test material. 


Circle every group of dots that has exactly four dots. 


Do as many as you can in ten seconds. 


Although scores on a test of this kind would h 


ave 
it can be shown that they would predict performan 
jobs. Employers would, of course, hav 
prospective employees to circle groups of dots. Their 
applicants who will later prove to be good work 


no obvious meaning, 
ce on certain clerical 
e little interest in the ability of 
purpose is to select 
ers. The purpose of tests 
ll individuals will 
assessed, in import 
a particular test is intended to pre 


of this kind is to predict, or forecast, how we 
or how highly they will be 
ment which 
criterion, 


perform, 
ant situations, The 


assess- 
dict is called the 


validation 
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It should be understood that the term “prediction”? is being used in 


the broad sense to encompass functional relationships of all kinds. A test 
could conceivably be used to “predict” what people have done in the past. 
in the same wav that the detective gathers clues to “predict” who com- 
mitted a crime. Also, tests could be used to “predict” how people behaved 
when they were children. Tests are often used to “predict” a current 
condition rather than to predict in the stricter sense, or to forecast future 
behavior, A typical example is a test for brain damage. The purpose is 
to “predict” the present condition of brain damage, and the proof of the 
“prediction” would be borne out by detailed physiological investigations. 
The Statistics of Prediction. No test is a perfect predictor and, as we 
all predictors entail more error than accuracy in 


shall see later, nearly 
performance. The relative accuracy cf predictor tests is 
determined from statistical analy: statistical descriptions of the num- 
ber of people who are properly and improperly classified, statistical esti- 
mates of the per cent of persons who can be safely forecast at particular 
sssment, and statistical estimates of the amounts of 
l by using a predictor. Because of the need for 
termining the utility of predictor tests, the next 
statistical models which are mest often 


forecasting 


levels on the ass 
money that can be savec 
statistical measures in de 
several chapters will concern the 
used in testing. 

Construction of a Predictor. A pre 
esis that test materials of a particular kind will predict a particular 
al problems will predict 


dicter test begins with a hypothesis. 


a hypoth 
assessment. We might hypothesize that numeric r 
how well individuals will perform in mechanical engineering school, The 
task would then be to assemble a large number of numerical prob- 


lems such as: 
(a) 268 


32 


(b) 14 X 64 = ——— (c) 95,842 


A large collection of items of this kind would be called an item pool. The 
problem is then to select from the item pool the minimum number of 
items which afford the best prediction of an assessment. This is done by 
is, which will be discussed in Chapter 


statistical procedures of item analy 
8. 

Examples of Predictors. Most of the measures which psychologists con- 
struct are predictors. There are very few places in which the psychologist 


? Different terms are often used to refer to functional relationships with past, c 
rent, and future events. Functional relationships with past events are ie vro ea 
termed “postdiction.” Functional relationships with current statuses are alten all 2 
"diagnosis Whereas the proper meaning of the word "predict" is to forecast " va 
behavior, it will be used here to refer to all three kinds of functional relations. "MES 
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has the responsibility of deciding what is good ben 5 ES gatto 
performances, the exception being Seen Bie eda ae Ss B AE 
to his own students. All the aptitude tests are predictors; they arc Deni r 
to forecast how well people will perform as salesmen, ident ia B 
captains, artists, etc. All personality tests RES: predictors, 1 "er ane pu 
to predict the likelihood that a person will have a "nervous brea 
down,” enter a mental hospital, or how he will react in emerge 
effective he will be in positions of leadership, and how 
likely to be in marriage. M 

The so-called intelligence tests were initially instruments of the 
dictor type. Although some have tried to argue that the 
measures, i.e., assessments, of intelligence, the domain of 
actions that could be called intelligent behavior is not 
different investigators, and the domain is too large 
sampling. 

Whether an instrument is classified as an a 
often depends on the way in which it is used. On à test of attitudes 
toward Eskimos, for example, an individual answers yes 
such as, Do you think that Eskimos are as intelligent 
“Would you be willing to have an Eskimo as 
you be willing to make a donation for he 
said that the individual responds as though he is favorable to Eskimos, 
Consequently, the instrument could be considered an 
verbal report. However, there is no guarantee th 
actually behave in a friendly manner 
treat them as intelligent persons, m 
money for their aid. Therefore, the 
of how people will behave, 

An interest test can demonstrate what 
to do, but the verbal report alone 
will actually seek out the thing: 
dividual who says that he likes cl 
and may habitually switch the 
other forms of musical fare. Similarly with the response 
opinion survey, it is not necc sarily so that 
though he holds a stated conviction. 

When an instrument tries to de 
Sponse, it can be said that an asse flort to use 
the response as an indication of how an individual behaves in other sit- 
uations or how he will react in the future means that the instrument is 
being used as a predictor. 

Numerous studies h 
and what they do! 


ncies, how 
successful he is 


pre- 
se tests are 
all possible 
agreed on by 
to permit an adequate 


ssment or as a predictor 


' to questions 
as other people?" 
a personal friend?" “Would 
Iping Eskimos?” It could then be 


assessment of 
at the individual would 
toward Eskimos, that he 
ake friends with them, or 
instrument is being used as 


would 
donate 
a predictor 


an individual says that he likes 
gives no assurance that the individual 
s which he professes to enjoy. The in- 
assical music may never attend 


a concert 
radio from a presentation of Be 


ethoven to 
s given in an 
an individual will behave as 
termine only an individual's ve 


rbal re- 
ssment is being made. Any e 


ave found relationships betwee 


n what people say 
ater, but the important point here 


is that it is necessary 
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to prove such relationships rather than take them for granted. Also, what 
people say they like, think, and feel is an important type of information 
regardless of whether it predicts or is used to predict subsequent actions. 

When Not to Construct Predictor Tests. Predictor tests have proved 
themselves useful in countless practical situations. They have been par- 
ticularly useful in the selection and assignment of people in schools, in 
the Armed Forces, in government agencies, and more recently in industry. 
Care must be taken that the demand for new tests does not bring forth 
poorly standardized and improperly evaluated instruments. 

The test constructor must be cautious at the outset not to launch a 
program of research which has little chance of ending in success. Before 
the construction of predictor tests is undertaken, the nature of the assess- 
ment, or criterion, should be examined. It is most fortunate if the assess- 
ment is defined in terms of easily observed actions. If the assessment 
cannot be specified in terms of actions, it is essential that the assessment 
at least be expressed in terms of the persons who stand high and who 
stand low in the particular performance. It is very important in this 
latter case to make sure that the ratings of persons are reliable. If neither 
the actions can be specified nor the people be reliably rated, it is a waste 
of time to work on the construction of predictor tests. There will be no 
way to evaluate the tests which are constructed. In some situations, the 
best service that the test constructor can perform is to tell the people 
who want predictor tests that it would be futile to begin such work 


until more systematic assessments are developed. 


CONSTRUCT VALIDITY 
and assessments will serve to judge 
many, if not most, of the instruments in psychology. If a measure can 
à as either a predictor or an assessment, the 
rules for constructing and validating the measure m Sie 
There is a third category of measures that does not fit completely into 
either of the two basic divisions. Measures of this kind are encounter ed 
most often in psychological experiments. In à typical experiment, the 
purpose is to test the hypothesis that difficult learning tasks raise anxiety. 
In order to test the hypothe: s, it is necessary to measure anxiety. The 
measure of anxiety cannot be regarded as a predictor, because the pur- 
pose is not to predict some other action but to measure anxiety then and 
there. Neither can the measure be an assessment, because there is no 
agreed-on set of “anxiety actions,” or content, to be sampled. 
Construct validation consists of defining a measure in terms of numer- 
ous research findings. This is essentially the way in which intelligence 
tests have gained meaning. Enough research has been done with these 


The distinction between predictors 


be unambiguously categorized 
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instruments to know how the underlying function grows 1 pe 
and how intelligence test scores relate to numerous other NIE es. l 5 
gradual defining of an instrument in terms of what it aes ii 8 m 
approach to validating some psychological measures. It is ney sum em 
an easy course, and psychologists are not entirely in agreement. i 3E 
what constitutes construct validity. There is also à danger that instru- 
ments which should be considered as either predictors or as sm : 
validated as such will hide under the cloak of construct validity. Con- 
struct validity should be considered a last resort, when the instrument is 
not specifically intended to predict performance or when the 
cannot stand alone as a self-evident measure of what is inte 
ever, there are some instruments which can be 
lated research experience. 


'ssments and 


instrument 
nded. How- 
judged only by accumu- 


OTHER LOGICAL CONCEPTS 


The logic which has been presented here for the evaluation of psycho- 
logical measures is not one that would be agreed on by everyone in the 
field of testing. There are numerous other logical classifications which 
have been used in explaining the usefulness of measures. Some of these 
constructs will be listed and compared with the logical scheme de- 
scribed above. 

Face Validity. Face validity refers to whether or 
tent “looks” as if it measures what is meant to be 
certainly one of the important properties of an 
school examination; but this standard of y 
plied to predictor tests, Personality inv 
claimed to be v 


not the test con- 
measured. This is 
assessment, such as a 
alidity has often been ap- 
entories have sometimes been 
alid because they have face validity, This 
Way to evaluate a predictor, Indeed, with personality tests, 
the test items bear an obvious relationship to wh 
the more easily the test can be “faked” by the 

Predictor tests should not be 


18 ai poor 
the more 
at is being predicted, 
subjects. 
judged in advance because the 
not have face validity. Sometimes the most unlikely looking test will 
prove to be an adequate predictor. We said earlier that a predictor test 
represents a hypothesis. One of the prime rules in science is never to 
exclude a hypothesis before it has been tested, There are numerous cases 
in which scientific progress was held up because of a refusal on the part 
of some to let a hypothesis be tested. Some of these were the germ theory 
of disease, the heliocentric theory of the solar system, and the law of 
falling bodies,” Predictor tests can only be judged empirically, by the 
strength of their relationships with assessments, 
Correlations between Predictors, 
to correlate a new predictor te 


y do or do 


A not uncommon practice has been 
st with a better 


established predictor test. 
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If the correlation is high, this is taken as evidence that the new test is 
“valid.” The difficulty with this procedure is that a new test can correlate 
highly with an older test but not correlate at all with the assessment it 
is meant to predict. For example, the older test might correlate .60 with 
a particular assessment. The new test might be found to correlate as high 
as .70 with the older test. It is possible in this case for the new test to 
essment. It is always much more to the point 


correlate zero with the a 
to correlate a predictor directly with the thing it is meant to forecast. 

There are only two cases in which correlating one test with another 
will provide definite information. If the correlation between the two tests 
is nearly perfect, close to 90, then the two tests are almost identical and 
should be approximately equal in predictive effectiveness for any assess- 
ment. On the other extreme, if the correlation between the two tests is 
very low, approaching zero correlation, it is certain that the two tests 
are measuring different things. 

We can get some information from correlating one assessment with 
another of its kind. For example, we might try out two achievement 
tests which are both intended to assess the student’s knowledge of history. 
If scores on the two tests do not correspond, if the correlation is not high, 
we would wonder about the differences in content which made one 
measure different from the other. This might lead to a better appraisal 
of the subject matter, a more explicit outline of the major topics, and 
eventually to a more comprehensive achievement test. However, the 
observation that two assessments correlate is not necessarily an indication 
that the assessments are "good." For example, one instructor could meas- 
ure course achievement in history by seeing how far the students could 
throw the textbook. Another instructor could use as a test how far the 
students could throw their notebooks. The two measures would correlate 
highly, but neither would have anything to do with history. n 

Factorial Validity. A primary concern of measurement specialists is 
to find tests that measure some common function and thus define what 
is called a factor. Factor analysis is the statistical procedure for isolating 
these common functions and will be discussed in Chapter 9. It has been 
found, for example, that tests like reading coimpieliension, vocabulary, 
and word problems have in common a factor of verbal comprehension. 

sts define reasoning, memory, and perceptual factors. 
Factor-analysis studies help us to understand the nature of human attri- 
butes and provide a very useful classification scheme for available tests. 

If a test is described as a measure of verbal comprehension, some proof 
should be offered that the classification is correct. The proof consists of 
correlating the test with the factor which pie iitended ie measure. It is 
sometimes found in this way that a test which is intended to measure a 
particular factor is in fact more related to some other factor. Although it 


Other groups of te 
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is quite useful to learn the factor composition of 5 ihe 
term “factorial validity” should be accepted with some caution. aa 
if a test has high factorial validity, it may be invalid as a eigen il 
particular criteria. Factorial validitv indicates only that a test is properly 
classified but not necessarily that it is good or useful for any purpose. 
RELATIONSHIPS AMONG VALIDITY CONCEPTS 


Although most measurement specialists recognize predictive validity 
e H 
and content validity, there is some controversy 


y as to how these concepts 
apply to currently available tests. For example, there are some 
that intelligence tests should be judged by 
or content validity (see 4, chap. 7) 


who claim 
content representativeness, 
- The point of view given here is that 
intelligence tests must be validated by standards of predictive and con- 
struct validity rather than by standards of content sampling. Whereas the 
point advocated here is that most tests of personality should be regarded 
as predictor tests and validated as such, there are those who argue for 
face validity. (See 1, 2, 3, 4, 5 for alternative points of view about test 
validation.) 

Some measurement specialists regard f 
and factorial validity as the key 


actor analysis as an end in itself 
standard for test validation, The point 
of view advocated here is that factor analvsis is very useful in classifying 
and exploring the nature of tests, but that factor anal 
sidered only the first ste 
particular instruments, 
Even if a test can be classified clearly as a predictor or as an assess- 
ment, this does not mean that the standards and proce 
to other kinds of measures are not useful. 


validation procedures that apply to pre 
ments, such as with achievement tests. Although achievement tests, and 
all assessments, must ultimately be judged by standards of content repre- 
sentativeness, it is helpful to learn whether the tests correlate well with 
other achievement tests and whether the tests predict later achievement. 

Factor analysis is useful in exploring the content of achievement test 
batteries. For example, a factor-analysis study of achievement tests might 
show that a test which is intended to measure general knowledge of the 
Social sciences is overly slanted toward economics or toward some other 
special field. This information would help the investigators to revise the 
test in such a way as to include a broader coverage of soci 
material, 

Whereas a predictor test must ultimate 
dict (in the broad sense of the 
ing situation, other 


s should be con- 
p in determining the practical usefulness of 


dures that apply 
For example, some of the 
dictor tests are useful with assess- 


al science 


ly be judged by its ability 
word) important be 
Standards are useful. An e 


to pre- 
havior outside the test- 
ffort to understand the 
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ests and to hypothesize what the tests meas- 


content of predictor t 
exploring the nature of human abilities. As 


ure are natural wavs of 
said previously, factor analysis plays an important part in helping us 


was 
bilities as it is reflected in predictor 


understand the nature of human a 
. Also, it is often helpful to correlate one predictor test with a 
er to get some hunches regarding practical uses 


type tes 
number of others in ord 
for an instrument. 
Construct validation requires not 
with assessments and predictors but 
The major approach is to determine W. 
obtained from applving an instrument 


onlv some of the procedures used 
also some other procedures as well. 
hether the experimental evidence 

fits the "construct," or the theory, 


for which the instrument was designed. For example, in the construct 
validation of an instrument to measure “anxiety,” it is important to deter- 
mine whether or not the instrument gives results that are in compliance 
with a theory about anxiety. If the theory holds that anxiety increases in 
particular kinds of frustrating situations, an experiment can be under- 
taken to determine whether people who are placed in the situation obtain 
higher anxiety scores than people who are not placed in the situation. 
Many such experiments will help to determine whether an instrument 
measures a particular construct, such as the construct of anxiety in the 
example. T TP instrument is validated 
It is primarily important to keep in mind that an I aep "A n a $ 
for particular uses rather than in a general sense. For example, the test 
job performance well is not necessarily a good pre- 
it may have no validity at all for predicting 
accomplishment. A mathematics examination 
veness for a particular unit of instruc- 

tion, but the test will not necessarily be good for other ses: If, as in the 
example above, a test proves to have construct validity Be ed measte- 
ment of anxiety, it is not necessarily the case that the 1 E 115 
dictive of marriage happiness or any other net ck ue im wie be 
predicted. The proper aim, therefore, is to try to de S d we bn 
instrument serves some function, or set of functions, rather than to try to 


% generally g k 
prove that an instrument is ge nerally "gooc 


which predicts one 
dictor in general. In fact, 
other kinds of vocational 


may have good content representati 


SUMMARY 


versy as to how psychological measurements 
h they may not use this particular nomencla- 
ve the categories of "predictors" and 


There is still some contro 
should be validated. Althoug š 
ture, most psychologists recognize 
“assessments.” If an instrument 15 used à 
ment, the rules for validation are relative 4 
must demonstrate functional relations with import 


as either a predictor Or an assess- 
ly clear. Primarily, a predictor 
ant behaviors outside 
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the testing situation, and an assessment must rest on sound methods of 
content sampling. In addition to the primary validation procedure: 
predictors and assessments, some auxiliary empirical and st 
cedures are useful. 

Some instruments can neither be classified entirely as predictors nor as 
assessments. These types of instruments are often used in psychological 
experiments. If the instruments measure what they are purported ta 
measure, they should demonstrate construct validity. There is consid- 
erable controversy about construct validity—whether it is a meaningful 
concept, and if so, how it is to be studied. Construct validation is essen- 
tially a “bootstrap” operatien in which an instrument that initially has 


little meaning in its own right gains meaning through continued use and 
research. 


s for 
atistical pro- 
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alidity in psychological tests. 


CHAPTER 5 


Predictive Efficiency of Tests 


ed to improve the “bets” that are made about how 


Predictor tests are us 
As background for under- 


individuals will perform in particular situations. 
standing how tests are used, let us first look at a situation in which there 
is no test available for making predictions. Imagine that we are trving to 
predict the grade averages of graduating college students. The grade 
averages are at hand, the mean and standard deviation are known, but 
ition as to which student made which grade average. 
f the students personally and have no infor- 
s. Imagine then that someone 


there is no indica 
Also, we do not know any o 
mation about their individual capabilitie 
calls off the name of each student in turn and you are required to guess 
of each. With this complete lack of information about 
dents, what would you guess for Bill Smith, 


Mary Jones, Tom Wilson, and each student as his name is read off? 
Would the predictions be most accurate by betting that the whole 
group made low scores, that some made high and others low scores, or 
should the bets be made on some random basis? You would lose by 
choosing any of these courses. The proper course would be to bet that all 
of the students made exactly the same grade, the mean. The mean is the 
point about which the sum of squared deviations is a minimum; and 
without a knowledge of the ability of individual pupils, the over-all 
amount of error will be least by predicting that all students make average 
grades. 
Why should you make a series of be 
error? Only a minority of the grades will be exactly at the mean, and the 
standard deviation may indicate that some of the grades vary considerably 
from that point. The reason is that, in spite of some large errors in pre- 
diction that are bound to occur, the over-all amount of error in predic- 
tion would be even larger if any system of betting were employed other 
than betting at the mean for all students. We know this is so because of 
mathematical proofs, and we must learn to accept mathematical proofs 
even if they do not correspond with common-sense expectiuions, Proce- 
dures of statistical estimation are always on the conservative side, allow- 
ing bets to venture away from the mean as more and more information is 
71 


the grade average 
the ability of the individual stu 


ts that are sure to be largely in 
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obtained about particular persons. Only in the case at r 
test would the bets vary as widely as the actual assessment e s. P 

Statistical Methods in Scientific Activity. Perhaps it would ^ iod 
consider for a moment why there is need for statistical procedures i sci- 
entific enterprise. In classical mathematics, the kind that you 1 88 
tered in introductory algebra, simultaneous equations like the fol owing 
are often found: 


X T 
a 2 1 
b. 4 2 
e 8 4 
d. 10 5 


Each pair of numbers above forms an equation. Equation a says that 
when X is 2, Y is 1, or, in other words, that X is twice as large as Y. Equa- 
tion b says that when X is 4, Y is 2, which shows again that X is twice as 
large as y. Equations c and d also affirm that X is twice as large as Y. 
The same relationship would be found by adding the four equations 
together. When a number of equa- 
tions all supply the same informa- 
tion, thev are said to be equivalent. 

One of the exercises in elementary 
algebra is to make a graphic plot of 
equations like those above. This is 
done in Figure 5-1. Any set of two- 

variable equations, called a bivari- 
1 ate relationship, can be plotted as 
$9 10 in Figure 5-1 by locating the point 
¥ for each equation in terms of the 
VVT values on the X and Y axes. In Fig- 
functional relationship between variables Ure 5-1 all of the points lie on 
X and Y. straight line. 


nw B5 o O NA 


a 


Now let us consider some equa- 
ather than abstract mathematical rela- 
r example of the prediction of college 
à predictor test is in use. Let us say in this 
st was administered to incoming freshmen and 
to be compared with gr 


tions that concern real-life things r 
tions. Turning back to the earlie 
grade averages, imagine that 
case that a vocabulary te: 
that the scores are 
reach graduation. 

In order to find out how well the predictor test works, we would want 
to compare vocabulary scores with grade averages. Comparisons could be 
made of the raw scores on the two variables, but, because the size of raw 
Scores is dependent on the method of test construction used, this would 


ade averages when the students 
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cause some confusion. The important consideration is whether the stu- 
dents who make high vocabulary scores also make high grade averages 
and whether the students who make low vocabulary spores alas make low 
grade averages. Consequently, it is most meaningful to express both 
vocabulary and grade average in standard score units. To simplify the 
comparison, we will deal with only nine students. Means and standard 
deviations are obtained on vocabulary and grade averages for the nine 
students only. Standard scores are determined for the nine students on 
vocabulary and grade average. with the following results: 


Student Vocabulary (zz) Grade average (zy) 

a. 1.18 
b. ETI 
& 3 59 
d 39 — 1.18 
€. .00 .59 
f. — 39 — 59 
g — 77 — 59 
h. —1.16 — 59 

—1.55 — 1.18 


just as in the carlier example we considered each pair of numbers as 
constituting an equation, each pair of standard scores above also forms an 
equation. Person a has a vocabulary standard score (S.) of 1.55 and a 
grade standard score (zy) of 1.18. Then, by dividing 1.55 into 1.18, a ratio 
of .76 is found. In the previous example, only one equation was necessary 
ationship between two variables. If that were 


to determine the over-all rel : 
ach person's grade standard 


So in the present example, we could say that e 
Score is exactly .76 times as large as his vocabulary standard score. Let us 


£0 on to see what relationships the other equations specify. The equation 
for person b indicates that the ratio is 1.53, c indicates 77, and d indi- 
cates —3,03, It is obvious that these equations provide very different 
indications of the over-all relationship between vocabulary (z,) and 
grades (z,). When, as in the example here, simultaneous equations do 
not agree with one another, they are said to be nonequivalent. 
Correlation: Nonequivalent relations are the typical and not the unusual 
Scientific finding, Whether the subject matter is physics, chemistry, biol- 
OE, or psychology. the numerous observations tell different stories about 
the form of a particular relationship. In some investigations, where the 
Variables are simple and easily controlled, the divergence from equiva- 
lence is very small. In more complex problems, such S predicting how 
individuals will perform on a p ar job, the divergence from equiva- 
lence is usually considerable. thods were developed as a 


Means fa : iy 
Means of dealing with nonequis 


articul 
Statistical me 
alent equations. 
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ison is m: vocabulary rade standard scores 
A graphic comparison is made of vocabulary and grade stand d sec 
in Figure 5-2 (the “best-fit” line in the figure will be explained shortly ). 
i ie on a strai i is a 
It is obvious that the points do not lie on a straight line, but there is 
tendency for the two variables to go together. How would vou summar Hm 
the relationship? We might use only our impression and sav that there is 
a “strong correspondence” or a “moderate correspondence.” A better ap- 
proach is to use a statistical statement of the degree of relationship 
between vocabulary and grade averages. 


Ficure 5-2. Comparison of standard scores on a vocabulary test with grade point 
averages expressed as standard scores. 


Grade point average 
Zi 


3.00 


2.00 


1.00 "p. 
Best-fit line 


> 143 Vocabulary 
Nesgbulery ZU + — Zelt) test 
ce -300  -200  -100 1.00 . 


-2.00 


3.00 
471-1 
Grade point average 


There are a number of ways in which st 
the problem here. One that might seem re 
add together all of the vocabulary stand 
into the sum of the grade stand 


atistics can be developed for 
asonable at first thought is to 
ard scores and divide this sum 


ard scores. This method worked with the 
equations in the earlier example, but it will not do here. As you will 


remember from Chapter 3, the sum of standard scores on any variable is 
zero. Consequently, we would get zero equals zero as a summari ing 


equation, which is certainly true, but does not provide the needed 
statistic, 
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Previously it was shown that working with squared scores and squared 
deviations leads to some very useful results. We will appeal to that prin- 
ciple again by looking for a summary equation which leads to the smallest 
sum of squared errors in prediction. That is, we will seek a ratio between 
vocabulary and grade standard scores, such that the sum of the squared 
errors in predicting grades will be at a minimum. This means that we are 
secking an equation like the following: 


— ~ 
z, = bz. 


such that 
(S, — 7)? = a minimum 
where D is a constant 
=, is a grade standard score 
=, is a vocabulary standard score 
an estimate of a grade standard score 


85 18 

The equation above is referred to as a linear equation: no matter what 
value we choose for b, it results in a straight line of relationship. Of the 
infinite number of values that could be chosen for b, there is only one that 
satisfies the “least squares” criterion. The method for determining the 
correct b value is derived quite easily from elementary calculus. The 
method which is derived in this way can be applied to the present prob- 
lem. The first step is to multiply each vocabulary standard score by its 


Corresponding grade standard score, as follows: 


1.55 K 118- 1.83 
1.16 K 177= 2.05 
77 N 45 
89 X — 46 
00 X 00 

— 39 * 23 
= 2X AB 
—1.16 * 68 
1.83 


—1.55 X 
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The final steps are to sum the standard score cross products ( zu), 
and divide the result by the number of persons (N ). This is all that is 
necessary to obtain b, the value which satisfies the “least squares” cri- 
terion. In this case b is .78, which means that the best (least squares) 
estimate of each grade standard score is obtained as follows: 


w= 182 
7 T 


Then the estimated grade standard score for person a is 

2, =.78 X 1.55 
= 1.21 

Person a actually has a grade standard score 

estimate is in error by .03 score units. 

prediction by using b to summarize t 


squared errors will be less than th 
equation. 


of 1.18; consequently, the 
Although there will be errors in 
he relationship, the sum of the 
at obtained from any other linear 


Because the prediction equation above is 


a linear formula, b can be 
used to draw a “best-fit” line 


as is done in Figure 5-2, For every unit the 
line moves over the vocabulary axis, it moves up .78 units on the grade 
average axis. Those who have studied trigonometry will re 
the tangent of the angle formed by the best-fit line 
lary axis. 

When the constant term, b, is sought by 
respect to standard scores 
coefficient. Also, when de. 
ally employed inste 
as follows: 


cognize b as 
and the vocabu- 


the "least Squares" criterion in 
it is given a special name: the 
aling with standard scores, the 
ad of b. Then we can rewrite the 


correlation 
symbol r is usu- 
best-fit equation 
e 


=rz 


^y r 


To obtain the best estimate of the 
vocabulary standard score by the correlation coefficient, 

When the constant term is found for standard scores 
to place the best-fit line, it has descriptive V 
coefficient, r, varie 


grade standard Scores, we multiply each 


, it not only serves 
value of its own. The correlation 
s in different problems from +1.00 through zero to 
—1.00. If the correlation is +-1.00, it means that there is a perfect correla- 
tion. Then the prediction equation becomes 


x oz 
a, = 2. 


and the predicted stand 


ard scores will be exactly the 
grade standard scores. 


same as the actual 
As the correlation coe 


ficient goes downward from 
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--1.00, the predictions become poorer and poorer. When the correlation 


reaches zero, the prediction equation is as follows: 


=0 


Because the mean of a set of standard scores is zero, a zero correlation 
leads to the prediction that all scores fall at the mean of the assessment. 
A negative correlation means that high scores on one variable go with low 
scores on the other variable and vice versa. 

Although the formula for the correlation coefficient, r, above is rela- 
tively easy to understand, it is not the easiest approach to computing the 
correlation, If scores are standardized over several hundred persons, 
rather than over the nine cases employed above, it results in considerable 
Work. Therefore, it is easier to compute the correlation coefficient from 
deviation scores or from raw scores. Some manipulations of the basic 


formula will show how this is done. 


DE (5-1) 
2C) 
= a e 
a. 
r= 1 
(5-2) 


can be used to compute the correlation coefficient from 
í E : 

e can replace each x value in Eq. (5-2) by 
Y — M, and obtain the following raw score 


Formula (5:9) 
deviation scores. Next w 
X — M, and each y value by 
formula: 
‘ayy sY sy 
a NXT — (3X) (3Y) e" (5-8) 
"= /NSX? = (3X)! V/NXY? — (3Y)? 


d that the correlation coefficient specifies the rela- 
tionship between two sets of standard scores. Pithos bs deine 
begin with standard scores, as in Eq. (5-1), or e vus sin, ab x 
in the computations, as in Eqs. (5-2) and 15 oo 1 ed 
above will give exactly the same correlation coe 1 2 gh Eq. 


It must be remembere 
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(5-3) looks complicated, it is actually the easiest way to obt: 

relation coefficient if an automatic calculator is available. 
It is not only easier to compute the correlation coefficie 

tion scores and raw scores, it is often convenie 

equation from deviation scores and raw scores. 
The standard-score formula 


ain the cor- 


nt from devia- 
nt to form the prediction 


(5-4) 


(5-5) 


where / is the estimated deviation score on the assessment (grades ) 
x is the actual deviation score on the predictor (vocabulary ) 
r is the correlation between predictor and üssessment 
oy is the standard deviation of the assessment 
ox is the standard deviation of the predictor 
Formula (5-5) could be use 
lege grades from their dev: 
student ha 


d to predict students’ devi 
iation scores on the vocabul 
à vocabulary deviation score of 3 
and the standard deviations for vocabulary 
tively, the prediction is as follows: 


ation scores in col- 
ary test. Thus, if a 
and the correlation is .50, 
and grades are 2 and 1 respec- 


" 


y 


50 15 x8 
-35x8 


The prediction is that he will score .75 deviation score units above the 
mean of grade averages. Let us say that he actually makes a. deviation 
score of 1.95 in grades, and therefore, the prediction is in error by .50 
deviation score units. 


The prediction e 


quation can also be expressed in raw score terms: 


rr ( - (5-6) 
where Y’ is an estimated Taw score on the assessment (school grades) 
X is an actual raw score on the predictor (vocabulary) 
oy is the standard deviation of the assessment 
oe is the standard deviation of the predictor 
M, is the mean of the raw assessment scores 
M, is the mean of the raw predictor scores 

Although Eq. (5-6) looks complicated, it is 
Eqs. (5-4) and (5-5). Formula (5 
a raw vocabulary score of 20 w 


a straightforward ext 
-6) might predict that 
ill make a raw grade 


ension of 
à person with 
average of 4.0. 
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Both Formulas (5-5) and (5-6) can be used to plot best-fit lines. 
Formula (5-5) would be used if deviation scores are plotted, and For- 
mula (5-6) would be used if raw scores are plotted. The correlation 
coefficient and the three prediction equations above are the basic statis- 
tical procedures needed for the validation and use of predictor tests. 
The correlation coefficient is, of course, not used to predict grade aver- 
ages that are already known, but to forecast the grade averages of incom- 
ing freshmen. If subsequent groups of incoming freshmen are students 
of about the same caliber as their predecessors, the correlation between 
vocabulary and later grades will be approximately the same. Therefore, 
it is safe to use the first correlation to forecast the grade averages 


of incoming freshmen. 

The Standard Error of Estimate. Another way to understand correla- 
tional analysis is to think in terms of the errors of prediction. Figure 5-3 
shows a typical comparison of raw scores on a test and on an assessment. 
(Whereas in the earlier example the correlation was .78, here the correla- 
tion is 50.) Here we will assume that the best-fit line is known to begin 


Ficvnk 5-3. Scatter diagram of vocabulary scores and school grade averages. 
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TTE ake predictions. 

With, as shown i Figure 5-3, and see how it is usc d to mak I Eom us 8 

as shown in Figure in is pictured like that in Figure 5-3, it is 

When a bivariate relationship is pict : lotted in terms of inter 

referred to as a scatter diagram. The points are plo 1 in Sr ess 
as a sci g ; f the vocabulary scores fr 

Vals instead of discrete scores. That is, all of pe we an Pr ‘Odean 

through 9 are plotted in the first column, all of the scc g 
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19 are plotted in the second column, and so on for all inter 
9 2 

column is referred to as an “ar 
are intervals on the vocabulary 


;. Each 


as there 


and there are as many array 
xis. 

In order to predict any student's grade, simply find his score on ihe 
vocabularv axis and then move upward to the best-fit line. Then movc 
parallel to the vocabulary axis to the predicted grade average. if a student 
makes a vocabulary score of 10, the prediction is that he will make a 
grade average of approximately 2.92. If a student makes a vocabulary 
score of 80, the prediction is that he will make a grade average of approx- 
imately 4.12. In each case we predict as though all of the points lay on the 
best-fit line, but, in fact, the points in each a 


ay scatter above and below 
the line. This tells us that a prediction from any point on the vocabulary 


axis about how well an individual will perform in school is likely to be in 
error in proportion to the scatter of the Y points about the prediction line. 
If the scatter about the best-fit line is large, then prediction will be poor; 
as the scatter lessens, the prediction becomes better and better. 

In Chapter 3, the standard deviation was given as a measure of disper- 
sion. The same kind of measure can be used in the scatter diagram to 
describe the error in predicting grades from vocabulary scores, A measure 
of the standard deviation type could be determined for any one array by 
subtracting the predicted grade average (at the point of the best-fit line) 
from each of the grade averages actually found. We could then find the 
standard deviation of the prediction errors. A like measure could be 
obtained for each of the arrays, and these could be 


averaged to determine 
an over-all estimate of the dispersion about the best-fit line. The method 


used in practice is slightly different from this procedure. Instead of finding 
the separate dispersions and averaging them, all of the deviates are deter- 
mined about their respective points of prediction, and these are placed in 
the conventional standard deviation formula. The resulting value is called 
the standard error of estimate and will be symbolized as cant. 

The standard error of estimate is an absolute measure of error in pre- 
dicting Y, but the units in which the assessment is expre 
arbitrary functions of test construction practices. If, for 
assessment is a course examination, the grades could range from 1 to 10, 
50 to 100, or assume any other range in accordance with the 
grading used. Consequently, it is nece 
of estimate with the standard de 
efficiency of prediction can be det 

As was mentioned previously, if no knowledge were had about the 
assessment scores of particular people, the best bet would be that they all 
fell at the mean of Y. These bets would be in error \ 
the size of the standard deviation of asse 


sed are often 
example, the 


system of 
ssary to compare the standard error 
viation of the assessment before the 
ermined. 


in accordance with 
ssment scores which is actually 
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obtained. The standard deviation of Y would specify the likelihood with 
which predicted scores would fall within certain intervals on the assess- 
ment continuum. Then, as would be expected, when the correlation 
between the test and the assessment is zero, the standard error of estimate 
will be the same as the standard deviation of Y. As the correlation be- 
comes larger and larger, e. becomes comparatively smaller than oy. The 
ratio of these two measures of dispersion leads to a very useful index of 
predictive efficiency. 

It is easy to picture what the scatter diagram will look like as the cor- 
relation becomes larger and larger; the area of scatter will become pro- 
gressively more narrow. This pro- 
gressive narrowing can be seen in 
Figure 5-4. 

The correlation. coefficient con- 
cerns the squares of e, and «4. The 
squares are used because the result- 
ing formula indicates how the best- 
fit line should be drawn, as well as 
indicating the degree of relationship 
between the two variables. As was 
previously noted, the square of any 
standard deviation type value is 
called the variance. The ratio of the 
two squared values is then a vari- 
ance ratio, a term that will be seen p 
quite often in testing theory If this size correlations. 


Assessment score, “ 


Predictor test score, & 


wne 5-4. Areas of scatter for different 


ratio were used as a measure of 

relationship, it would read “backwards”: 
weak relationship, and a low ratio would represent a strong relationship. 
This difficulty is overcome by subtracting the ratio from 1, resulting in 


the following formula: 


a high ratio would indicate a 


(5-7) 


(9-8) 


lation coefficient, r? is said to be the "per 
spoken of as the per cent of 
er to visualize the meaning of r? 
y, the correlation coefficient is 


The letter r stands for the corre 
cent of variance explained,” and 1 — 
d." It is perhaps east 
n previous! 


variance unexplaine 
than r itself, but as was show 
directly useful in making predictions. 
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Although s. does not appear directly in Formulas (5-1), (5-2), or 
(5-3), it can easily be derived after the r is obtained. Formula (5-7) can 
be suitably transformed as follows: 


Gest = c, V1 Ep (5-9) 


Nonlinear Relationships. We have spoken so far only about linear rela- 
tionships. Formulas (5-1), (5-2), and (5-3) indicate the linear correla- 
tion between two sets of scores, and Formulas (5-4), (5-5), and (5-6) 
place the best-fitting straight lines in scatter diagrams. Formulas (5-7). 
(5-8), and (5-9) were used to illustrate the dispersion about the best- 
fitting straight line. A possibility that has not been mentioned so far is 
that the bivariate relationship might not be linear. A relationship is linear 
if the points in each array spread equally above and below the best-fit 
straight line. Technically, the relationship is linear if the mean assessment 
score for each array falls exactly on the best-fit line. 

In any particular scatter diagram, the arr 
compared with the best-fit straight line. 
means will diverge from the line by 
sampling error. 


ay means can be plotted and 
It is expected that the array 
small amounts if only because of 
However, in the use of psychological tests it is r 


are to 
find that the array means show 


some systematic trend other than a 


Ficunk 5-5. Scatter diagram illustrating a curvilinear relationship. 


Assessment score, Y 


Predictor test score, X 


straight-line function. Whe 


n this happens, the rel 
linear 


ationship is called curvi- 
an example of which is shown in Figure 5-5. Figure 5-5 shows that 
assessment scores increase with predictor test Scores up to a point and 
then start to decline, Then the people who make very high scores on the 
test do less well, as a group, on the assessment than the people who make 
only moderately high test scores. Although curvilinear relationships are 
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sometimes found in psychological experiments. such a marked departure 
from linearity as the one shown in Figure 5-5 is rarely found in using 


predictor tests. 

Formula (5-7) was illustrated with a linear relationship, but it can be 
applied equally well to curvilinear relationships like that in Figure 5-5. 
The only difference is in the way that the standard error of estimate is 
obtained. Instead of computing the statistic about the best-fitting straight 
line, oe can be obtained about the array means. The mean assessment 
score in each array is subtracted from the assessment scores in each 
array, The differences are then squared and summed over all arrays. The 
resulting sum is divided by the number of individuals. This is then the 
squared standard error of estimate taken about the array means. When 
this is placed in Formula (5-8), it gives a general index of correlation 
which is independent of the form of the relationship. Formula (5-8) is a 
measure of correlation that can be used on relationships of all kinds. 
When the standard error of estimate is obtained about the best-fitting 
straight line, the resulting statistic is called the correlation coefficient and 
svmbolized as r. When the standard error of estimate is obtained about 
the array means, the resulting statistic is called the correlation ratio or 


Eta. 
It must be kept in mind that the linear correlation formulas will place 


the best-fitting straight line in the scatter diagram regardless of whether 
ictually linear. Before applying the linear correlation 


the relationship is ) 
formulas, the scatter diagram should be inspected to see if the relation- 
ship is reasonably linear. If there is only a moderate curve in the rcla- 
tionship, the linear formula will still do an adequate job of describing 
the trend. However, if there is apparently a marked tendency toward 
curvilinearity, a test of statistical significance should be made (sce 3, pp. 
255-262). Then if the departure from linearity is statistically significant, 
either the correlation ratio should be applied or people should be told 
that the linear correlation formula was applied to a curvilinear relation- 
ship. When correlations are stated in research reports, readers will as- 
sume that the relationships are reasonably linear. Among the other as- 
sumptions which are made in employing the correlation coefficient, r, 
linearitv of relationship is the most important. 

If a linear relationship is found between predictor and assessment, this 
provides at least a practical justification of the original assumption of an 
interval scale for the predictor test. A linear relationship means that 
equal intervals on the predictor correspond to equal intervals on the 
assessment. Because a linear relationship is the usual and not the unex- 
pected finding in working with predictor tests, this shows that for 
practical purposes the assumption of an interval s ale and the use of the 
statistics which depend on this assumption are often justified. 
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In order to use o. as a measure of the error in making predictions, the 
bivariate relationship should have several characteristics in addition to 
linearity. Earlier in the chapter there was a discussion of the 
prediction in terms of the array dispersions. A 
not mentioned at the time is that the array dispe: 
example is shown in Figure 5-6. Although rel 


errors of 
possibility which was 
rsions are not equal. An 
ationships similar to the 


Ficure 5-6. Scatter diagram in which array dispersion incre. 
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one in Figure 5-6 are sometime 
sidered an undesi 


s found in practice, it is usually con- 
The o, obtained for this relatie 
error in predicting from high 
in predicting from low 
ation coefficient is that the dispersion about 
for all of the arrays on the X axis. The pres- 
rsions is referred to by the rather formidable 


ible condition. 
would be an underestimate of the 
and an overestimate of the error 
assumption in using the correl 
the best-fit line is the same 
ence of equal array dispe 
name of homoscedasticity. 

Not only is it desirable to have equal array dispersions, it is helpful if 
the points in each array are normally distributed about the best-ft line. 
This means that the points are more numerous where the best-fit line 
Crosses the particular array and progressively rous in moving 
upward or downward from the best-fit line, in accordance with the 
frequencies expected from the normal distribution, Normality of array 
dispersions is necessary if the ov is used as a standard de 


ationship 
X scores, 
X scores. The 


less nume 


Viation type 
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measure. Only if the distribution is normal will 68 per cent of the cases 
fall between plus and minus 1 standard error, and so on for all other 
segments of the distribution. 

If array dispersions are normal, the standard error of estimate can be 
used to determine the probability with which predicted assessment scores 
will deviate from actual assessment scores by particular amounts. If, for 
example, a predicted score is 40 and the c4 is 5, the odds are only 5 
in 100 that the actual assessment score is greater than 50 or less than 30 
(remember from Chapter 3 that only 5 per cent of the cases are beyond 
plus 2 and minus 2 standard deviations of the normal distribution). If 
the array dispersions are not normal. c4 will lead to inaccurate prob- 
ability statements. It is fortunately the case that area dispersions are 
almost always normal. 

Another feature which is desirable in the correlation problem, but not 
absolutely necessary, is that each of the two variables has normal dis- 
tributions of scores. In practice it is found that highly skewed variables 
and other forms that differ markedly from the normal are often accom- 
panied by one or more of the several undesired features in the bivariate 
relationship, such as nonlinearity. 

Regression. Looking back at Eq. (54), it can be seen that the best 
prediction is alwavs that an individual will make a standard score nearer 
the mean of the assessment than his standard score on the predictor— 
except in the case of a correlation of 1.00. Also, because each of the 
Standard scores on the predictor is multiplied by a constant, r, the stand- 
ard deviation of the predicted scores will always be smaller than the 
standard deviation of the real assessment standard scores. Because of the 
need to make bets nearer the mean of Y than e, would indicate, the 
best-fit line is often referred to as the regression line and Eqs. (5-4), 
(5-5), and (5-6) as regression equations. 

The observed tendency for predictions to regress toward the mean 
was the original stimulus for developing the correlation coefficient. For 
example, it was noted that the sons of very tall fathers tended to be 
and that sons of very short fathers tended to 


Shorter than their fathers 1 à ad 
be taller than their fathers. There is a simple reason why the regression 


tendency occurs. The people who make the highest scores on the predictor 

e d > hi st assessment scores or they can vary 

test can either make the highest assessmen : 5 Aee 

downward toward the mean. Consequently, the array means o he 

extremes on the predictor regress toward the assessment mean. The in- 
H x 2. i ge " o TY 

between arrays tend to follow suit, resulting in the general tendency of 
regression toward the mean. ne F } I 

The slant, or slope of the best-fit line is evidence of the regression 

tendency, The slope indicates the number of units of X that are required 

to advance a unit of Y. When dealing with standard scores, the correlation 
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coefficient is the slope of the best-fit line. For example, when eee is 
a correlation of .50, the best-fit line moves over two units on the predictor 
axis before going up one unit on the assessment axis. . 

Correlational analysis can be used to predict the XS from the Y's 
as the Y’s from the X’s. It is arbitrary which variable is ] 
placed on the horizontal axis rather than labeled Y 
vertical axis. In the example above, the height of f. 
dicted from the height of sons. The correlation coefficient will be the 
same regardless of which variable is labeled X. Also, we could predict 
test scores from grade point averages, although there would be no prac- 
tical purpose served. Correlation. Formulas (5-1), (5-2), and (5-3) 
would be unchanged, and we would obtain the same coefficient of .78 
that was determined earlier, Regression, correlation, and standard. error 
of estimate formulas (5-4) through (5-9) would be altered by sub- 
stituting Y's for X’s and Y subscripts for X subscripts 
most testing problems it is important to estimate the assessment from 
the predictor test, but not vice versa. Consequently, the 
be labeled Y and shown on the vertical 
the regression problem will be treated 
assessment scores. 

Linear Transformations. 


as well 
abeled X and is 
and placed on the 
athers could be pre- 


and) vice versa. In 


assessment will 
axis throughout the text, and 
in terms of the prediction of 


One important property of the correlation co- 
efficient is that it is completely insensitive to linear transformations of 
the predictor and assessment scores. A linear transformation consists of 
adding à constant amount to each Score, subtracting à constant amount, 
multiplying or dividing each of the scores by a constant amount. For 
example, the correlation in Figure d remain .50 if the number 10 
million were added to each of the scores on the predictor test and each 
of the assessment scores were multiplied by 20 billion. The reason why 
the correlation coefficient is insensitive to linear transformations is that 
it is a measure of the relationship between two sets of stand 
Either standard scores are used in the correl 
version to standard scores is done in the 
computational formulas. 


5-3 woul 


ard scores. 
ation formula or the con- 
deviation score and raw score 


THE STATISTICAL SIGNIFICANCE OF CORRELATIONS 


Before correlation coefficients ay, they should be 
tested for statistical significance, In undertaking à correlational analysi 
the interest is usually in determining the relationship between two vari- 
ables, here predictor test and assessment scores, in a whole Population, 
or universe, of people. We are seldom interested in only the relatively 
small sample of persons whose Scores are actually used to compute the 
Correlation, Therefore, we must question the precision with which a 


are used in any w 
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correlation coefficient determined from a sample of persons mirrors the 
real correlation between variables in the whole universe of persons. 

To illustrate the influence of sampling error on correlation coefficients, 
imagine that we are trying to learn the correlation between height and 
weight for British male adults. It would be exceedingly laborious to 
measure the height and weight of every male above twenty-one years in 
England; and it would be no small amount of work to compute the 
correlation. If a research worker set out to learn the correlation between 
height and weight, he would probably work with only a sample from 
the universe. The sample might be quite large, say over ten thousand 
men, or perhaps only a few dozen men would be used instead. 


Ficure 5-7. Distributions of sample correlations when the universe correlation is .00. 
(N is the sample size.) 
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Even if the universe correlation is zero, the correlation found for a 
particular sample probably would not be exactly zero. We might draw a 
number of different samples of persons, say all of them with 100 persons, 
and run the separate correlations. It would be expected that these correla- 
tions would range around zero. There would be no reason to expect more 
positive than negative coefficients or vice versa. The dispersion of the 
coefficients about zero would be dependent on the N or sample size. That 
is, the correlations obtained from successive samples would crowd nearer 
ro when the sample size is large—assuming again that the universe 
zero. The distributions of coefficients that would be 
obtained for different sample sizes would look like those in Figure 5-7. 
Because the true correlation is zero in Figure 5-7, all of the non-zero 
coefficients are due to chance. On occasion these chance coefficients can 
be quite large. Notice that with an N of 10, correlations greater than 
-.50 and less than —.50 would occur appreciably often purely by chance. 


to ze 
correlation is exactly 
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Before any effort is made to interpret a particular correlation, 
assurance must be obtained that the coefficient is signific 
from zero. Otherwise the correlation may represent a chance sampling 
of people from a universe in which the true value is zero. One way to 
gain some confidence in particular correlations is to learn the odds, or 
the probability, with which the correlation could have been obtained by 
chance alone. If the probability is very low, say 1 in 100, or at the .01 
level as it will be called, then it is relatively safe to assume that the 
correlation is not merely a chance relationship. 

The Standard Error of the Correlation. In Ch 
the normal curve can be use 


some 
antly different 


apter 3 it was shown how 
probabilities, For example, 
s would be beyond plus and 


d to determine 
it was said that only 32 per cent of the case. 
minus 1.00 standard deviations, and that only 5 per cent of the cases 
would lie beyond plus and minus 2.00 standard deviations, If it were 
known that chance distributions of correlations such as those shown in 
Figure 5-7 are normally distributed, it would be possible to determine 
the probability of Betting correlations of certain sizes by chance, All that 
is required is to Obtain the standard deviation of the distribution of 
coefficients. This can be obtained approximately as follows: 


1 
— (5-10) 
VII 


Or = 
The statistic 7, is called the stand 
Its use can be illustrated by 
height and weight of men ñ 
persons, the standard e 


ard error of the correl 
returning to the correlation between the 


n in England. If our sample consisted of 101 
rror of the correlation would be: 


ation coefficient. 


= .10 
This means that, if the universe correlation is zero, the stand 
of correlations obtained with samples of 10] Persons will be 10 Then 
we would expect 68 per cent of the correlations obtained by dance not 
to exceed plus or minus lO0—in other Words, only 39 per cent would ex- 
ceed these points. Thus, if a correlation is obtained for a sample of size 


101, the 7; allows us to determine the odds that the sample was drawn 
from a Population which has a Correlation of zero, 


ard deviation 
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It is customary not to accept a finding as statistically significant unless 
it is at the .05 level or beyond. Looking at the per cents of cases in 
various areas of the normal distribution in Appendix 3, it is found that 
5 per cent of the cases lie beyond plus and minus 1.96 standard deviations 
from the mean. By multiplying 1.96 times the o, for a particular sample 
size, the .05 level for correlation coefficients can be determined. When 
the N is 101 and, consequently, the c, is .10, the .05 level would be repre- 
sented by correlations greater than +-.196 and less than —.196. Because 
it is customary to use only two decimal places for correlations, the .05 
level would be set at correlations of +.20 and —.20. Then if with a 
sample size of 101 persons a correlation as large as +.20 or —.20 is found, 
it can be said that the odds are less than 5 in 100 that the sample was 
drawn from a population in which the correlation was zero. This is another 
way of saying that the particular correlation is significant at the .05 level. 
In a similar way it can be found that correlations of .26 or above and 
correlations of —.26 or below are significant at the .01 level for a sample 
size of 101 (see Appendix 3). The standard error of the correlation can 
be found for any sample size by substituting the N in Formula (5-10). 

Because of an untenable assumption, the standard error of the corre- 
lation has only limited use. The formula assumes that when samples are 
drawn from a universe in which the correlation is zero the sample 
correlations will be normally distributed. This is not quite true for any 
size of sample, and it is a particularly bad assumption for samples with 
less than thirty cases. The c, offers a good approximation to the needed 
error term under two conditions: (1) when the problem is to determine 
the significance of a particular correlation from zero correlation, and 
(2) when the sample size is reasonably large, say with an N of 100 or 


more. 

Although v, provides a good estimate of the sampling error of correla- 
tion coefficients under the conditions stated above, there are more pre- 
cise error formulas that can be used (see 1, 2, 3, 6). The standard error 
of the correlation coefficient was described in order to illustrate tests of 
significance. Rather than go more deeply into the statistical checks which 
can be applied, we will discuss the implications of significance tests for 


measurement programs. . 

Before any practical decisions are made about the usefulness of a 
predictor test, it must first be determined whether or not the test is 
significantly related to the criterion variable. Unless the correlation is 
based on a very large number of cases, say over 300, there is consider- 
able sampling ‘error. Any decisions about using a test should be tem- 
pered by the size of the sampling error as well as by the size of the 
obtained correlation. Samples of less than a hundred persons have so 
much sampling error that they do not offer firm enough evidence for 
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the acceptance of tests. It is only as the sample size gets to be larger than 
several hundred persons that we can have confidence in the indicated 
validity. [ 

If the problem in a testing program is to decide whether or not to use 
a particular test, it must first be determined if the test-assessment correla- 
tion is significantly different from zero. If the correlation is not signifi- 
cantly different from zero, the predictor test should be held suspect. It 
is always possible to gather a larger sample and examine the significance 
of the new correlation. The burden should always be pl 
to prove its utility. 

In some testing situations there is a minimum validity that a test must 
have in order to be usable. For example, it might be the case that the 
expense of using a test can be justified only if the test-assessment corre- 
lation proves to be greater than .30 in continued use. If on the first sample 
of test responses a correlation of .55 is found, a statistical check can be 
made to see if the correlation is significantly greater than .30. 

Another problem that will occur quite often in testing is that of decid- 
ing whether one test works better than another in predicting a particular 
assessment. For example, a vocabulary test and an arithmetic test could 
be given to fourth-grade students. It might be found later that the vocab- 
ulary test and the arithmetic test correlate .60 and .40 respectively with 
school grades. If the intention is to use only one test to assign children 
to class sections, it would have to be determined whether the apparent 
difference in predictive power could likely be due to chance. 


aced on the test 


THE INFLUENCE OF THE DISPERSION ON CORRELATIONS 
Àn important property of the correlation coefficient is that its size is 
directly related to the standard deviation of the assessment. Looking back 
at Figure 5-3, what would happen to the correlation if only the people 
with grade averages above 2.67 were considered in the calculation? This 
would certainly reduce the standard deviation of grades, Y, but it would 
have relatively little influence on Sest. Remember that the 
equal array dispersions requires that oet will be the 
the region of predictor scores being considered. If only the people who 
score above 2.67 in grades were considered in the computations, the 
correlation would be smaller, Look back at Formula (5-7) and you will 
see that this is the only possible result. Whenever the dispersion of ability 
on the assessment is altered, the correlation changes. Be careful to note 
that the change in dispersion must be a change in sampling such that the 
range of “real ability” varies. As you remember, changing the size of the 
standard deviation by a linear transformation of the 
will have no influence on the correlation. Whe 


assumption of 
same regardless of 


assessment scores 
Teas sest tends to be the 
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same in different samples of persons, the correlation varies with the 
assessment dispersion. 

Because of the influence of the assessment dispersion on validity, the 
standard deviation of the assessment scores should be compared with that 
which is generally found in the assessment situation. If the dispersion used 
in validation work is different from the usual dispersion found, then a 
statistical correction gives an estimate of what the test validity would 
be with a more representative dispersion of assessment scores (see 5, 
pp. 169-176). 

The influence of the dispersion on test validity relates to an interesting 
and often misunderstood observation: intelligence tests become less 
successful predictors as higher educational levels are reached. The tests 
correlate around .70 with grammar school grades, around .60 with high 
school grades, around .50 with college grades, and they tend to correlate 
only slightly with the grades of students in graduate school. The reason 
for the progressive decline in validity is that the dispersion of intellectual 
ability is gradually being decreased. The less able students are dropped 
out year by year. By the time graduate school is reached, there is rela- 
tively little dispersion of intellectual ability. Tests can “commit suicide,” 
if, as is often the case, the tests are used at various stages to determine 
who should continue in school. Theoretically the stage could be reached 
where the test correlates zero with the assessment which it has been so 
successful in predicting over the years. This is much like the physician 
who advises his patients so successfully on how to remain well that he 
finds himself with no ailments to treat. 


THE SELECTION OF PERSONNEL 


have been discussed in this chapter are all 
aimed at answering the question, “How good are particular tests?” In 
many situations, an ability test which correlates 50 with its criterion will 
be considered as showing “good prediction. To find personality tests of 
equal predictive efficiency would be considered a mark of outstanding 
as you know by now, a correlation of .50 means that 
only 25 per cent of the variance in the assessment can be explained by 
the predictor. Another way of looking at it is that a correlation of 50 
substituted in Formula (5-9) shows that the standard error of estimate 
will be .87 as large as the standard deviation of the assessment. On the 
face of it, this seems like rather poor prediction and a discredit to psycho- 


logical tests. 

If near perfect correlation 
then nearly all of them would be 
need not correlate highly with an 


The statistical tools which 


success. However, 


, were the standard for psychological tests, 
considered very poor. However, a test 
assessment to perform the job that is 
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needed in many situations. How this is done can be demonstrated coats a 
typical personnel-selection situation. The number of people who apply = 
a job is usually larger than the number who are selected. The ratio o 
these is referred to as the selection ratio: 


number selected / H 
— selection ratio 


number who apply 
If it were necessary to hire all of the people who 
ratio would be 1.00 and there would be 
logical tests. Of the people who are se 
and some will not succeed. The unsuccessful employees may be released, 
shifted to other jobs, or remain on the job at an unsatisfactory level of 
performance. The number who succeed provides another important ratio: 


applied, the selection 
no room for the use of psycho- 
lected for a job. some will succeed 


number who succeed k 
== = enin] S Success ratio 
number selected 


To illustrate the importance of the selection ratio and the success ratio, 
consider a hypothetical situation in which 100 persons are selected at 
random from the group that applies for a job. Imagine that a predictor 
test has been given to the entire group that applied, which, of course, 
could not have affected the random selection of persons. Suppose that the 
selected group is placed on the job and only 54 per cent of the people 
prove to be successful enough to continue. Later it is found that the pre- 
dictor test correlates well with job performance scores. This prediction 
situation is illustrated in Figure 5-8. 

Instead of using dots, numbers of subjects are re 
diagram in Figure 5-8. Because 100 persons 
also be interpreted as per ce 


corded in the scatter 
are shown, the numbers can 
nts. As would be expected from the mean- 
ing of the correlation coefficient, smaller and smaller per cents of fail- 
ure are found as higher scores are made on the predictor test, None 
of the people who make à Score of 20 or below succeeds on the job. Of 
the people who score 50 on the predictor test, 50 per ce 
of the people who score as high as 70 on the 


The information in Figure 5-8 tells us that it would be foolish to go 
on selecting in a random manner, that the test can be used to improve the 
selection of employees. Just how successful the test can be in selecting the 
applicants is a function of the selection ratio, In this validation problem, 
25 per cent of the subjects made scores of 70 or higher, and all of those 
succeed on the job. Because the group was randomly selected, it might be 
expected that, on the average, an equal per cent ‘of all new applicants 
would make scores in that range. Suppose that 100 persons are needed 


nt succeed. None 
predictor fails on the job. 
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for the job exemplified in Figure 5-8. Then, in order to select a group in 
which no failures are likely to be found, the number applying would 
have to be 400. 


Ficure 5-8. Relationship of predictor test scores to job success. 
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Usually it is not possible to find a combination of the proper correlation, 
selection ratio, and success ratio to obtain persons who all will succeed. 
But it is possible, often with only moderate-sized correlations, to choose a 
substantial preponderance of successes (see 4). : f 

If the success ratio is large, there is no great need for an improvement 
in selection procedures. The extent to which the test raises the success 
ratio is the measure of the test's utility in a particular situation. If the 
selection ratio is large, there is nothing that a predictor test can do to 
improve selection. One obvious way to improve the selection situa- 
tion is to obtain as many applicants as possible for the job, assuming that 
the new groups of applicants are of the same ability levels as the ones 
who have been applying. In many situations, an improvement in successes 
of only 10 per cent will save thousands of dollars ordinarily wasted in 
training persons who will not succeed, as well as sparing the persons con- 
cerned the pains of failure. ; Meee , 

In some selection situations, the success ratio tends to remain the same 
after predictor tests have been brought into use. This is often the case in 
a certain per cent of the students are failed out 


school programs, where 5 : E l i 
: a valid predictor test is used in selection, either 


as a matter of custom. If 
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the number of failures decreases or the standards for success shift 
upwards. . . mE . s 

Tests are used as often in vocational guidance as in selection. In voca 
tional guidance there is no selection ratio to consider. Some good agisce 
must be given to each individual. For example, a student igit consider 
going into engineering on the basis of high scores on certain ability tests 
and the results of an interest inventory. When it is necessary to make a 
decision about each person, not just select the special group who score 
very high on a predictor test, the error in prediction is considerably larger. 
Even if the predictor-assessment correlation is as high as .71, the standard 
error of estimates is one-half the size of the assessment standard deviation. 
Consequently, when it is necessary to advise separate individuals, much 
more confidence can usually be felt when dealing with persons from either 
extreme of the test continuum. Persons who make very low scores can 
safely be dissuaded from entering the particular vocation or school pro- 


gram, and persons with very high scores can safely be encouraged to 
follow their ambitions. 


Looking back at Figure 5-8, it is obvious th 
higher than 30 on the predictor test would be 
for that job; and persons who score 


at persons who score no 
well advised not to apply 
as high as 60 could safely be advised 
to apply. The error in advising grows as we get into the “indifferent zone” 
of scores between 30 and 60, If a person has a score of 50, giving advice 
on the basis of the test alone would be no better than flipping a coin, 

The region of maximum error in making decisions is not alwa 
middle of the predictor test continuum, If the job is ver 
sequently, most persons succeed, the “indifferent zone’ 
score section of the predictor. If the job is so 
succeed, the “indifferent zone” will lie in the 
farther people are on the predictor test continuu 
the point of complete “indifference,” 
cent fail, the safer it is to 


ys in the 
y simple and, con- 
lies in the lower- 
complex that only a minority 
upper-score regions. The 
m in either direction from 
where 50 per cent succeed and 50 per 
advise them or make decisions about them. 
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CHAPTER 6 


Reliability of Measurements 


Reliability concerns the precision of measurement regardless of what 
is measured. Some random error is involved in all scientific measurements. 
For example, a metal ruler expands and contracts with changing tempera- 
tures. This introduces some error into measurements and, consequently, 
lowers the precision of measurements made with this ruler. Measurement 
error of this kind tends to obscure scientific lawfulness. For example, if 
there is a close relationship between electrical resistance and length of a 
wire, the relationship would be somewhat obscured by measuring wire 
with the unstable metal ruler. This is why precise measurements can be 
made only by holding temperature constant or by using a measuring in- 
strument that does not vary with temperature. Psychological measures 
also contain a portion of random error analogous to that of the metal 
ruler. In this chapter we will discuss some of the ways of detecting and 


eliminating measurement error from tests. 

A careful distinction should be made between test validity and test 
reliability. A test can have high reliability and not be valid for any par- 
ticular purpose. For example, we might use the weight of individuals to 
predict college grades. Whereas weight may be measured very precisely, 
and thus be highly reliable, weight would be invalid as a predictor of 
college grades. As will be shown in the pages ahead, the opposite is not 
true. In order for a test to be highly v: 


ilid, it must be highly reliable also. 
High reliability is a necessary but not sufficient condition for high 
validity. 

Measurement Error and Predictive Efficiency. Figure 6-1 shows a hypo- 
thetical relationship between a predictor test and an assessment. As would 
be the case only in hypothetical circumstances, the test predicts the 
assessment perfectly, and consequently, all of the scores lie ona straight 
line. Let us see what happens when some error is introduced into the test 
scores, What we will do is to flip a coin for each score in turn; if it turns 
up “heads,” we will add three points to the score; and if it is “tails,” the 
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Ficure 6-1. Relationship between a predictor test and an assessment before measure- 
1G 8 
ment error is added. 
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score will be left as it is. Flipping the 


coin for e 
turn, we find the following: 


ach of the test scores in 


Original score Coin flip New score 
0 II 3 
1 T 1 
2 H 5 
3 H 6 
4 H 7 
5 y 5 
6 T 6 
T H 10 
8 H 11 
9 H 12 

10 "P 10 
11 T ll 
12 H 15 
13 T 13 


In Figure 6-2 the new scores are plotted, with tl 
ponent, against the assessment, The 
been changed by the random 
Figure 6-2 it can be seen th 
effect of 


1e included error com- 
assessment scores have, 
additions to the predictor test scores. In 
à at there is no longer a perfect correlation. The 
adding error to the scores is to lower the correlation, Starting off 
with a perfect relationship, there is no way for the relationship to change 
except to a lower correlation. The important point is that the addition of 
random error will tend to lower the correlation no matter what it is origi- 
nally. If we had pictured a scatter plot with a correlation of .50, the 
random addition of score points would have tended to make the correla- 
tion nearer zero, Likewise, if we had à correlation of —.50, the random 


of course, not 
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addition of score points would have tended to make the correlation 
nearer zero. 

Random error works to make an observed relationship less than it would 
be if the randomness were not present. This is why we speak of error, or 
unreliability, as attenuating, i.e., lessening, a correlation. With as few 
scores as those in the example above, it is possible that the random 
changes in scores would either leave a correlation unchanged or could, in 


Ficure 6-2. Relationship between a predictor test and an assessment after measure- 


ment error is added. 
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a very rare circumstance, make the correlation larger. However, the odds 
are against anything but a lowering of the cor ation; and if the number 
of subjects is as large as it is in most testing situations (over 100 persons ), 
it can be predicted not only that the correlation will be lower, but with 


fair accuracy, just how much it will be altered. | e 

A rule can be stated which will help us in discussing test reliability: 
If a subgroup of individuals is randomly chosen from a it Huh ar 
differential treatment of the subgroup which tends to make the members 
all score higher or all score lower than if the differential oe had 
been applied will tend to lower the correlation uia i pad a d t 5 
total group and any other variable. There are mapy Ways m " hich test 
ered in this way. Suppose that you are develop- 
ing a test to predict school grades and that the whole group Weh you 
intend to test contains several hundred students. If some of the students 
are tested when they are tired, say after a long as trip, sud others are 
tested after a good night's rest, the scores of the tired group will tend 
to be lower. If some of the students are tested in a quiet, well-lighted 


scores can be randomly alt 
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room and others are tested in a gloomy room adjacent to noisy PA 
the grades of the students in the less comfortable situation will tend to 15 
lower. If the test instructions are given to one subgroup in a complete anc 
clear manner, and if the test instructions are given to another subgroup in 
an incomplete and confusing manner, the grades of the latter subgroup 
will tend to be lower. If in each of these circumstances it can be thought 
that the students fall into the subgroups on a near random basis, the 
resulting influences on test scores will cause unreliability and will lower 
the ability of the test to predict school grades. 


THE THEORY OF MEASUREMENT ERROR 


It is easy to introduce error into test scores purposely, as was done in 
the example above, and to determine the effect it will have. However, in 
nearly all practical situations, error is not purposely introduced into scores, 
and the problem instead is to try to determine the 
has occurred in spite of the tester’s best efforts. 

Samples of Tests. In a number of places so far in this book, we have 
had to think of the content of a test, particularly of an assessment, as 
being a sample from a larger collection, or universe, from which items 
could conceivably be drawn. It was said that a sampling logic for test 
content is not as easy to maintain as is a sampling logic for universes, or 
populations, of persons. However, here again it will be necessary to dis- 
cuss content sampling to provide measures of reliability, 

Any test can be thought of as one test in a universe of possible tests. If 
an achievement examination is being constructed, say to measure prog- 
ress in spelling, many different tests of the same length could conceivably 
be made. Similarly with a predictor test, the individual form can be 


thought of as only one from a universe of tests which could be composed. 
Whether the test is an assessment or a 


mate the score that the individual would 
universe of tests, If the score that an individual makes on one sample test 
is markedly different from that which he would make on another sample 
test, this would mean that either one or both tests are unreliable, 

The Estimation of “Universe” Scores. The problem is to estimate an 
individual’s score on a hypothetical universe of tests from the score which 
he makes on a sample test. (To keep in step with accepted terminology, 
universe scores will be referred to as “true” scores.) The most accurate 
way to estimate a “true” score would be to give the individual a large 
number of tests, say, 100 spelling tests over a period of several months, 
and then average the scores. It is expected that the errors would average 


out over time and over the different situations in which tests are adminis- 
tered, The individual who is lucky on one day 


amount of error which 


predictor, the purpose is to esti- 
make if he were given the entire 


and manages to guess 
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some of the items would probably be unlucky on another day. The indi- 
vidual who is in either a poor mood or poor physical condition on one day 
would probably be more fit on another day. Similarly for the other sources 
of error variance, the errors would tend to. average out. Consequently, the 
average score over the 100 test administrations would tend to correspond 
closely to an individual’s “true” score. 

Although measurement error would be considerably reduced if a large 
number of similar test forms could be used, this would be a thoroughly 
impractical procedure. In order to obtain estimates of “true” scores with- 
out this labor, some assumptions must be made. If sample tests all con- 
taining the same number of items were administered to a group of per- 
sons, the assumptions would be these: 

1. All sample tests have the same standard deviation. 

2. All sample tests correlate the same with one another (it does not 
matter what the correlation is). 

3. All sample tests correlate the same with the universe scores (it does 
ation is). These three assumptions are all that 
are required to deduce mathematically the formulas for determining the 
influence of measurement errors on test scores. This mathematical devel- 
opment is given in Appendix 4 for the reader who is interested. 

The Reliability Coefficient. A useful measure, if it could be obtained 
directly, would be the correlation of the scores from a sample test with 
the correlation of the universe or “true” scores. This would be sym- 


bolized as 


not matter what the correl 


Tiu 


where r stands for the correlation coefficient 


subscript 1 stands for test 1 


subscript u stands for universe or true” scores 


ove are valid, then my can be determined from the 
ample tests. ( Each of the sample tests could 
he correlation between any two of these 
d we will keep it in mind that the letters 
Then it is proved in Appen- 


If the assumptions ab 
correlation of any two of the s 
be labeled 1a, 1», 1, and so on. T 
tests will be symbolized as rin aN : 
à and h are omitted to simplify subscripts. ) 
dix 4 that 


fiu = Vin (6-1) 


„o sample tests is called the reliability coefficient. 


The correlation of tw à s 
Because its square root is an estimate of the correlation of either of the 
sample tests with the "true" scores, the reliability coefficient is said to be 


the per cent of non-error variance (this would be the same as the square 
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of ri,, and you will 1 that the i ios à correlation coefficient 
i t nt of “variance expla k g 
I. ds S Poon (6-1) pines on how well the 8 era 
above hold in particular testing situations; and because we el a > 
they will not hold precisely, we must question the accuracy wit : y E 
fiti is estimated in practical circumstances. First we will go on to 0 5 e * 
measurement error statistics and return in the latter part of this chapter 
to consider the practical problems in the estimation of re 
The scores that people actually make on 
their universe or “true” score as follows: 


liability. 
a test can be used to estimate 


Xn rx! (6-2) 

where x, is the individual's estimated "true" deviation score 

xı is the individual's deviation score on test ] 

75 is the correlation between any two sample tests from a content 

universe 

If an individual has a test deviation score of 10 and the reliability of the 
test is .90, the best estimate of his “true” score is 9. If another individual 
has a test deviation score of —20, his estimated "true" score is 18. The 
best estimate of the standard deviation of "true" scores is then 


70, = oT, 


(6-3) 
Where c, is the standard deviation of test 1 
1, is the reliability coefficient 
7, is the estimated standard deviation of “true” scores 
Unless the test is perfectly reliable, the standard deviation of “true” scores 
will be less than the standard de 


viation of the sample, 
"True" scores are of more theoretical than practic: 
retain their same relative score Positions in respe 
"true" score continuum as they have on the test actually being used. 

Standard Error of Measurement, Formula (6-2) provides only an esti- 
mate of “true” scores, The Precision of the estimate is dependent on how 
well the reliability coefficient is determined, Some methods of estimating 
the reliability coefficient will be discussed in the ections, Even 
if the reliability coefficient is determined in the most rigorous manner, 
there is still an element of error in the estimation of “true” scores, 

Looking back at Formula (6-2), it can be seen that “true” scores are 
estimated from the reliability coefficient, which is a special kind of cor- 
relation coefficient. In Chapter 5 it was shown that the correlation coeffi- 


cient not only provides a way of estimating one set of Scores from another, 
it indicates the amount of error involved in the estimation: 


or obtained, scores. 
al interest. People will 
ct to one another on the 


following s 


the standard 
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error of estimate. The reliability coefficient can be used in a similar way 
to estimate the error dispersion of sample scores about “true” scores: 


Seas = 0/1 — Ti (6-4) 
where cue is the standard error of measurement 
oi is the standard deviation of test 1 
ry, is the reliability coefficient 
Using the example above of a test with a standard deviation of 10 and 
a reliability coefficient of .90, the standard error of measurement would 
be obtained as follows: 


= 10\/I— 90 
= 10/10 
= 10 x 333 


= 3.33 


meas 


It must be kept in mind that the standard error of measurement ranges 
about the estimated “true” score for an individual, not, as is often mis- 
takenly assumed, about the obtained, or sample, score. 

The score that a person makes on a test should never be taken as an 
exact point. First, it is necessary to recognize that measurement error 
tends to push scores out in both directions from the mean. High scores 
are usually overestimates of ability and low scores are underestimates. 
Second, measurement error introduces a zone of uncertainty about each 
estimated “true” score, the size of which is indicated by the standard 
error of measurement. Consequently, an individual's score should be con- 
sidered as lying somewhere in a band along the score continuum. The 
width of the band which is used depends on the precision with which 
scores will be interpreted. If we set the odds as 1 in 100 that the indi- 
vidual’s score will not be overestimated or underestimated, the individual 
can be considered to lie in a band stretching 2.56 standard errors of 
measurement (see Appendix 3) above and below his estimated “true 
score. 

The use of the st 
vals can be illustrate 
two persons are estimate 
error of measurement is 3.33. 


andard error of measurement to set confidence inter- 
d with the example above in which "true" scores for 
d to be 9 and —18 respectively and the standard 
The .01 confidence level would then lie 8.52 
(2.56 times 3.33) score units above and below each estimated “true” 
score. For a person with an obtained deviation score of 10 and an esti- 
mated “true” score of 9, the confidence band would stretch from 48 to 
17.52. In other words, if the individual were given 100 sample tests which 
met the assumptions given earlier, we would expect to find only 1 of the 
100 scores either less than .48 or greater than 17.52. Because most test 
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scores are reported as integers, the confidence zone limits could be 
rounded to the nearest whole numbers of 0 and 18. The .01 confidence 
band for a person with an obtained score of —20 and an estimated “true 
score of — 18 would extend from — 26.52 to —9.48. 

The measurement error confidence band is often indicated by stating 
the individual's score as so much plus or minus 2.56 standard errors of 
measurement. For example, an IQ might be reported as 120 plus or 
minus 10. However, a common mistake is to space the confidence zone 
symmetrically about the obtained score rather than about the estimated 
"true" score. The correct procedures are either to space the confidence 
zone symmetrically about the estimated “true” score or asymmetrically 
about the obtained score. Illustrating the former, the scores in the exam- 
ple above could be phrased as 9 plus or minus 8.52 and —18 plus or 
minus 8.52. The latter approach would be to say that the scores 
10 plus 7.52 or minus 9.52 and —20 plus 10.52 or minus 6.52. Only if an 
individual's obtained score is exactly at the mean and, consequently, the 
deviation score is zero, would the confidence zone be symmetrical. Then 
with a standard error of measurement of 3.33 in the example above, the 
score would be stated as 0 plus or minus 8.52, 

The Correction for Attenuation. It was said earlier th 
tends to attenuate, or lower, the correlation of a test with any other 
variable. The theory of measurement error (see Appendix 4) allows us to 
estimate how much the unreliability influences the 
test. For example, if we have correl 


are 


at unreliability 


predictive power of a 
ated the scores on a test with a set of 
assessment scores and if we have made an estimate of the test reliability, 
à prediction can be made as to what the test assessment correlation would 
be if the predictor were made perfectly reliable: 


72 — (6-5) 
Vin 
where 1; is the reliability estimate for test 1 

71» is the correlation of test 1 with the assessment, 2 

T,» is the estimate of how much 1 would correl 


ate with 2 if the 
test were made perfectly reliable 


Formula (6-5) can be illustrated in the situation where 


a test correlates 
50 with an assessment and the test reliability is estimate 


d as being .81: 


Reliability of Measurements 103 


We would expect then that as the test is made more and more reliable its 
predictive validity will move toward .56. 

The correction for attenuation is often said to estimate the “true” rela- 
tionship between a test variable and an assessment; however, because of 
the assumptions which must be made, such estimates should always be 
taken with a “grain of salt.” Where the correction for attenuation is help- 
ful is in indicating the improvement in predictive validity that might 
follow from an improvement in test reliability. In the early stages of a 
testing program, it is often necessary to compose short and relatively 
unreliable tests. The correction for attenuation provides a helpful sugges- 
tion as to how well particular tests will work if they are lengthened and 


more highly standardized. 
Whereas it is important to consider the measurement error in predictor 


tests, it is equally important to consider the measurement error in the 
assessments they are meant to forecast. In many situations, the assessment 
is less reliable than the predictor test. This is particularly so when the 
assessment consists of impressionistic ratings by foremen and supervisors. 
The reliability of the assessment can be used in the correction for attenu- 


ation as follows: 


n 


fc (6-6) 
Vires 

where 1, is the correlation of a test with a completely reliable assess- 
ment 

7,5 is the correlation of a test with obtained assessment scores 


is the reliability of the assessment 


Ta» 
lates 32 with an assessment, and the reliability of 
correction for attenuation would be as follows: 


fuc —— 
"Vi 


2 


If a predictor test corre 
the assessment is .64, the 


8 
= 40 


Whereas the correction for attenuation due to the unreliability of the 
predictor test is only an indication of the increased validity that might 
come from improving the predictor, the correction due to the unreliability 
of the assessment indicates how well the test “really” works at present. Pre- 
dictor tests often work much better than the test-assessment correlations 
indicate. If, as is often the case, assessments are relatively unreliable, the 
predictors actually do a much better job than the correlations show. It is 
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then quite sensible to correct for attenuation due to the assessment unreli- 
ability, as shown in Formula (6-6), to estimate the “real” test validity. 

Finally, a double correction for attenuation can be made which con- 
siders the unreliability in both the predictor and the assessment: 


Mf 


where fuu is the correlation between a perfectly 


Puu = 


(6-7) 


reliable predictor and a 
perfectly reliable assessment, and other symbols retain the 
meanings given above 


For example, if a test correlates .36 with an 
abilities of the predictor and the assessment 
the correction would be as follows: 


assessment, and if the reli- 
are .81 and .64 respectively, 


36 
V81\/ 64 


Ta, = 


= 50 
The correction for attenuation 
correlation would be if the te: 
used to estim 
ability is r 


can be made to estimate not only what a 
st were made perfectly reliable, it can be 
ate how much the predictive validity will rise if the reli- 
aised by any particular amount (see Appendix 9). 


SOURCES OF UNRELIABILITY 

It was stated that any 

ability. There are many 

of the most promine 
ing sections. 

Poorly Standardized Instruction 


random influence on test scores will cause unreli- 
ways in which this can happen in practice, Some 
nt sources of unreliability are discussed in the follow- 


ns. If parts of the test instructions are 
given orally by the tester, there is always the possibility that testers will 
vary the instructions to an extent which will introduce measurement error 
into the scores. The tester who gives inadequate instructions, failing, for 
example, to mention that the subjects can go back later to questions which 
are left blank initially, will penalize his group. If the tester is too lenient 
in setting the time limit for a test, the group will make higher scores than 
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would have been the case with a strict adherence to time. If the test 
instructions vary in any way from group to group, the results of the test 
will tend to be unreliable. 

When test instructions are not written out as part of the examination 
booklet, 2 standard set of instructions should be made for all testers to 
read to their groups. In addition, all testers should adhere strictly to time 
limits and other standardized procedures. 

Errors in Scoring Tests. On multiple choice tests, the errors in scoring 
are purely mechanical. If the test is scored “by hand,” it is possible to 
accidentally score some “correct” answers as “incorrect” and vice versa. 
Also, errors can be made in counting up the number of “correct” and 
"incorrect" answers for each person. If tests are machine scored, an 
improperly functioning machine can add a considerable amount of meas- 
urement error to test scores. 

The errors in scoring an "essay" examination are more subtle in charac- 
ter, but they also make for unreliability. One of the prime sources of 
unreliability in scoring essay examinations comes from the different grad- 
ing standards used bv different instructors. If three instructors teach dif- 
ferent sections of the same course, they will probably have at least slightly 
different ideas about what kinds of answers merit good and bad marks. If 
»wn examinations, the grade that a person gets 
is partly dependent on the instructor he happens to have. 

Measurement error is also "built in the individual instructor. If an 
instructor regrades a set of essay examinations Aften a period a Pma 
and there are no identifying marks to indicate what the earlier grades 
were, the grades will be somewhat different the second time. Another 
source of measurement error is that the instructor often changes his 
a set of essay examinations. He might have 
strict standards at first; but as he goes through the papers and Sees that 
none of them measure up, he is likely to become more lenient. Conse- 
quently, the grade that a student obtains is dependent on the chance 
appearance of his test in the order in which they are graded. Some ways 
reliability of essay examinations will be discussed in 


each instructor grades his c 


standards as he goes through 


of improving the 
‘he nit 

1 a M to Testing Environment. An effort should be made to test 
sr uniform conditions. Car ed to extremes this rule becomes 
However, gross differences in testing environment 
as testing one group in a poorly lighted, noisy room 
xmfortable surroundings. 

In Chapter 3, we discussed the effect of 
guessing on multiple choice tests and the inequities scoring which can 
result from a failure to make corrections for guessing. The correction 
formula presented there is true only on the average. It is not capa- 


subjects unde 
impossible to follow. 
can be removed, such 
and testing others in more cs 

Errors Due to “Guessing. 
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ble of specifying exactly how much an individual knows. 
viduals may be “lucky” enough to guess consid 
expected number of items, and other individuals may guess fewer rag 
than would be expected on the average. Therefore, guessing on multiple 
choice tests adds a certain portion of unreliability to test scores. Also, 
because the individuals who know the least do the most guessing (assum- 
ing that everyone tries every item), guessing tends to make lower scores 
less reliable than higher scores. 

A free-response test, one in which the 
answer, is theoretically more reliable th 
ever, there are few subject matters which can be cast in the fre 
form, the requirement being that there be only one word or t 
represents the correct answer. The point to consider 
choice tests become more reliable as the number of 
is increased. If the subject matte 
five or six alternatives are use 
or three. 

Errors Due to the Sam 
individual's score on 


Some indi- 
erablv more than the 


subject supplies the correct 
an a multiple choice form. How- 
c-response 
erm which 
is that multiple 
alternative answers 
r permits, a test will be more reliable if 
d for each item instead of only two 


pling of Content. If the obje 
à universe of conter 
particular test will introduce some 
the feeling that he was "lucky" 
topics in the course, the 
he knew. Another 
whole course m 


ct is to estimate an 
nt, the sampling of content for a 
unreliability. Every student has had 
on a particular test, that of the many 
instructor happened to ask about things which 
student, with the same level of understanding about the 


aterial, may have been “unlucky” in that he happened not 
to know about the particular questions, 


As in all sampling problems, the mo 
ability will come from the sam 
tests are more reliable 
ability due to sampling error, 
(and also the items should r 
ever, this is somewhat in o 
guessing by making as m 


re items there 
pling of items, Gener 
orter tests, In order to reduce the unreli- 
as many items as Possible should be used 
ange broadly over the subject matter). How- 
PPosition to reducing the unreliability due to 


i mal any alternatives for each item as possible, Due to 
the difficulties in composing tests, and the limited amount of time avail- 


able for testing, some compromise must be reached between the need to 
have both many items and many alternative answers for each item. That 


is why many standard tests have 

y y sought a compromi i i 
; D ; n 
using about four or five z " m qo 7 E 
a any items 


are, the less unreli- 
ally speaking, longer 


i alternatives, with 
permit. 


Errors in the Individual. A | 
most tests is due to “ch 


as time will 
a ] 
ance” fl us portion of the Measurement error in 
tie eli pua uctuations in the individual which raise or 

Ms score. A fly lighting on the Student's test a broken pencil, a 
momentary distraction, a cone next school dance, a mistake 
m marking the š 


“chance” events can 
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influence test scores. Although such “chance” influences tend to average 
out over a long, well-standardized test. there always remains a component 
of unreliability due to errors in the individual. 

Errors Due to Instability of Scores. In most testing situations, an indi- 
vidual's score is expected not only to represent his immediate standing 
but his standing for some time to come. The stability which is expected 
of scores is relative to the type of test. If students are given the same 
examination or a similar examination several weeks after the first testing, 
it is expected that the two sets of scores will correlate highly. Scores on 
intelligence tests are expected to remain relatively stable over long 
periods of time. If a child's IQ goes up or down bv as much as 10 points 
over a period of two years, it makes the test very difficult to use. On the 
other hand, there are measures which are expected to change in relatively 
short periods of time. If an experiment is being conducted to alter the 
attitudes of a group of students towards some political issue, it is expected 
that a test of attitudes given after the experiment will show differences 
from a test given before the experiment. Whenever a psychological process 
is being studied or when individuals participate in an experiment, it is 
expected that changes of some kind will occur. If measures remain abso- 
lutely stable during the study, then there are no research findings to 
report, except to say that nothing changed. 

Instabilitv of test scores acts much like the other sources of error to 
reduce the reliability. If intelligence test scores obtained in the fourth 
grade are used to place children in special classes in the eighth grade, 
and if the scores have fluctuated markedly during that time, the test 
scores will do a poor job of classifying the children. If a child is being 
placed in a special class because the earlier testing showed an IQ of 125 
and if a test given four years later shows an actual IQ of only 110, the 
original test score would give a faulty picture of the child’s ability. It is 
theoretically possible that earlier scores will be more predictive than 
scores obtained at the time when decisions are made about people, but 
this is almost never found in practice. The safest assumption is that if 
scores are used over a period of time as an indication of an individual s 
standing, and the scores are found to be unstable, the instability makes 


for unreliability. 


ESTIMATING THE RELIABILITY COEFFICIENT 

What has been said so far about the sources of measurement error and 
error on test scores depends on the theory 
of reliability, the conditions of which were given earlier. However, the 
i f the practical circumstances in which tests are 
can only estimate the reliability coefficient 


the influence of measurement 


theory is an idealization o 
used, and consequently, we 
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rather than determine it directly. There are a number of ways of estimating 
the reliability coefficient, each of which has its own advantages and 
disadvantages. ae ee 

Retest Method. The simplest approach to determining the reliability 
coefficient is to give the same test on two occasions. The correlation 
between the two sets of scores will be an estimate of the reliability 
coefficient, r11, as it was denoted earlier. 

There are two important disadvantages in using the retest method. The 
first is that the obtained reliability coefficient will reflect little of the 
error due to the sampling of content. The second is that the individual’s 
memory of his answers to the first test administration is quite likely to 
influence the answers which he gives on the second test administration. 
Such influences should dwindle as the time between the two test admin- 
istrations is lengthened, and therefore, it is wise to wait at least several 
weeks before retesting. However, it is possible that memory will affect 
the second administration to some extent e 
between testings. Me 


ven if there are several years 
mory works to make the two sets of test score 
late highly, and consequently, the reliability coefficient 
overestimate when determined by the retest method. 
Equivalent-form Method. Instead of using the 
sions, it is better to use two “equiv 
forms and “parallel” forms). Th 
form, two forms can be 


S corre- 
is usually an 


same test on two occa- 
alent" forms (also called "comparable" 
at is, instead of making up only one test 
constructed which are very much alike. If, in 
constructing either an achievement test or a predictor. 
procedures have been used (test items have 
collection of relevant items), an equivalent form can be obtained by the 
random selection of new items. However, in the construction of most 
tests, the sampling is not so exact: the items are simply collected across 
a broad range of the hypothetical universe of content without an explicit 
sampling. Then, to obtain an equivalent form, the test constructor must 
try to make up a test that “looks” as much like the first form as 
Sometimes this is a fairly easy task, for example, where the 
spelling items, vocabulary items, or arithmetic it 
other tests of ability, and in m 
to compose equivalent forms. 
When the reliability coe 
will manifest more of the 


explicit sampling 
been sampled from a large 


possible. 
test contains 
ems. But in some of the 
any of the personality tests, it is difficult 
fficient is obtained from equivalent forms, it 
Sources of measurement error, and more accu- 
rately, than any other method. It will contain all of the sources of meas- 
urement error found in the retest method and, in addition, will give an 
indication of the amount of error due to the sampling of content, Although 
the memory of one test form may give the individual a slight advantage 
on taking the equivalent form, the effect of memory on the equivalent 
form reliability coefficient is slight. Were it not for the expense and the 
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difficulties of making up equivalent forms, this method of reliability esti- 
mation would almost always be preferable to the retest method. ; 

Subdivided-test Method. Instead of making up equivalent forms, a 
compromise procedure has been to obtain “part scores” for different sec- 
tions within the same test. The most popular of such procedures, referred 
to as the “split-half” method, is to give the individual one score on all of 
the even-numbered questions in the examination and another score on 
the odd-numbered items. The two halves of the same test can then be 
correlated to obtain an estimate of the reliability coefficient. A statistical 
correction then must be made to estimate the reliability of the whole test, 
not just of the half-tests (the correction is given and discussed in Appen- 
dix 5). 

Practical considerations have caused the subdivided-test method to be 
used as extensively as it has. In many situations it is either too expensive 
to compose equivalent forms, or it is not possible to obtain the same sub- 
jects for a second test administration. However, there are some serious 
flaws in the use of the subdivided-test method. Although the method 
determines some of the error due to the sampling of content, it does not 
do this as well as the equivalent-form method. The subdivided test shows 
none of the error due to instability over time, because both halves of the 
test are given at the same time. For these reasons, the subdivided-test 
method usually gives an overestimate of the reliability. 
arly misleading to use the subdivided-test method on a 
test (see 1, pp. 110-114). Such a test would be, for 
example, one in which the subjects are asked to complete as many simple 
arithmetic problems às possible in a short period of time. If the problems 
are very simple, it is a test of how fast the individual can work. Then each 
person is likely to get nearly all of the items correct as far as he goes, and 
individuals will differ mainly in terms of how far along in the items they 
are when time is called. This works to make the individual's scores very 
much alike on the split-halves. The odd and even scores will be alike on 
most of the items up to the point at which time is called, because most of 
them will be correct. Also, the odd and even items will be alike beyond 
that point, because they would all be incorrect. The split-half reliability 
estimate obtained on a purely speeded test, such as the one illustrated 
here, would be completely meaningless. Fortunately, there are very few 
tests which are purely concerned with speed. Most tests have some time 
limit simply as a practi al way of getting the test completed, bnt time, as 
such, might be only a trivial influence on the scores. However, we should 
alwavs be wary of split-half reliability estimates obtained on highly 


speeded tests. 
Item-interre 
of testing will encount 


It is particul 
highly “speeded” 


lationship Methods. The individual who works in the field 
er special methods of reliability estimation based 
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on homogeneity, or amount of correlation among the item 5 
within one test (see 2, pp. 385-389). Because these methods work wit m 
the items on one test, they provide no indication of the fluctuation in 
scores over time. What these formulas do is to provide a conservative 
estimate of the subdivided-test type of reliability. The item-interrelation- 
ship formulas are the preferred procedures when the retest method is not 
advisable and when equivalent forms are not available. 

Measurement of Stability. Most psychologists prefer to consider the 
measurement of stability of scores over time as a separate problem from 
the measurement of the other sources of error variance. As was said pre- 
viously, stability is expected of some instruments and not of others. Also, 
whether or not it is necessary for test scores to remain stable depends on 
the way in which the test is used. If an instrument is used over a number 
of years to make predictions, then stability of scores is essential, 

It should be obvious that stability can be measured only by the retest 
and the equivalent-form techniques and not by the subdivided-test and 
item-interrelationship techniques. Stability can be measured either by 
retesting or, preferably, by administering equivalent forms over the range 
of time in which the test is used to make predictions, The measurement of 
stability is of more theoretical interest than practical importance. It is 
usually unsafe to use test scores obtained at one time to make 
or counsel individuals at a much ! 
time when decisions n 


predictions 
ater time. It is better to give tests at the 
nust be made about individuals. The measurement 
of stability is primarily important in helping us understand how human 
attributes develop and change. 

In using either the retest or the equivale 
uring reliability, it is customary to space the test administrations several 
days to a month or more apart. This is done not 80 much to measure 
stability over time but to measure “errors in the individual.” It is reason- 
able to expect day-to-day fluctuations in moods and physical conditions 
to influence test scores, If the two test administrations occurred on the 
same day or only several days apart, it would provide an insufficient basis 
for measuring the errors due to fluctuating moods and physical condi- 
tions. Also, if the time span between administrations is very short, indi- 
viduals will remember the first test and will tend to repeat their work 
habits, guessing behavior, and characteristic mistakes on the second test. 


nt-form techniques for meas- 


THE HANDLING OF MEASUREMEN 


T ERROR 
IN A TESTING PROGRAM 

Because of the tendency for errors of measurement to lower test validity 

regardless of the sense in which validity is determined, unreliability is 

“bad” in any measurement situation, No definite rule can be stated as to 
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how high the reliability coefficient should be for a test, but in general one 
suspects a test that has a coefficient less than .80. Some of the better- 
standardized instruments have reliability coefficients over .90. 

It must be remembered that measurement error is important only 
because it attenuates test validity. If a predictor test has a high correlation 
with its criterion, reliability is no problem. The test constructor is con- 
cerned with measurement error when a test fails to predict its criterion. 
It is sometimes found that the low predictive power of the test is due to 
unreliability from one or more of the error sources described previously. 
By removing some of the sources of measurement error, the predictive 
power of the test can be increased. 

Increased Standardization. Rather than worry about unreliability after 
it has occurred, it is primarily important to remove as many sources of 
measurement error as possible when the test is being constructed. Any- 
thing that works to standardize the test instructions and procedures of 
administration, prevents errors in scoring, equates testing environments, 
and lowers the influence of guessing, will make a test more reliable. 
Errors in the sampling of content can be lowered to some extent by 
either an actual sampling of items or a careful effort to range the items 
across a defined universe of materials. However, even with the best sam- 
pling of items, there will be some sampling error due to the limited num- 
ber of items in any test. 

It is often erroneously assumed that the need for test standardization 
requires that tests should be given under the most comfortable conditions 
possible. The requirement is that all subjects. should be treated in the 
same way. A particular test may be most valid if the subjects are told 
little about the instrument or if they are purposely confused. It may also 
promote test validity if the subjects are purposely distracted and harassed 
while the test is being administered. h 

Increasing the Number of Items. After all efforts have been made to 
increase test standardization, the reliability can be raised by making the 
test longer. The increased length will act to reduce je e due to 
"guessing," the errors due to the sampling of content, and the errors in 


the individual. If we make several reasonable assumptions about the 

effect of lengthening a test, it can be predicted how much the reliability 

will be raised by increasing the number of items (the formula is pre- 
d 


sented and explained in Appendix 5). A ; NTE IAE 
e of the tester’s best efforts, there will remain 


Residual Errors. In spit 
some measurement o and there will be considerably more in some 
tests than in others. In some of the personality inventories where the 

a ; 
individual is asked vague 
liable. Similarly, nothing can ; 
i n; 
over time, which are errors only in the sens 


questions, the scores will inevitably be unre- 
be done to remove errors due to instability 
e that they hinder predictions. 
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If the trait being measured fluctuates over short periods Wt ite ee = 
will be of little use in predicting behavior. For example, it is relative y 
meaningless to give interest tests to grammar school 55 p 
pose is to use the results in planning future training, T ds "d la ue 
dren say they will want to study and work. at professionally underg 
marked changes in relatively short periods of time. i 

Interpretation of Reliability Coefficients. All commercial tests should 
report complete reliability data (see 4). If the test manual 
reliability information or states a reliability coe 
it was determined, the test should be viewe 
the manual should tell the standard de 
which the reliability estimate was made. The reliability estimate is a cor- 
relation coefficient, and like all correlations, its size is directly dependent 
on the dispersion, or standard deviation, of scores. The reliability coeffi- 
cient is directly meaningful only when the standard deviation of test 
scores is the same as the standard deviation for the group with which the 
test will be used. It was said in Chapter 5 that the standard error of esti- 
mate tends to remain unchanged with changes in test dispersion, but that 
the correlation coefficient tends to change with the dispersion. Similarly, 
the standard error of measurement tends to remain the same with changes 
in the dispersion of test scores, and the reliability coefficient becomes 
larger as more diverse groups are studied, 

If a test is being constructed to predict success for college applicants, 
and a reliability study is being made of the test, the group used in the 
reliability study should have a standard deviation of test scores which is 
very nearly the same as that of the total group of applicants. If the reli- 
ability study is made on a group of freshmen, this will not include the 
individuals who were refused admission, and consequently, the standard 
deviation of Scores in the reliability study will be too small. The reliability 
coefficient will then be an underestimate, If the st 
known both for the whole group and the 
reliability study, statistical corre 
2, p. 392). 


There have been cases where the erroneously high reliability cocffi- 
cients which have been reported in test manuals were obtained by com- 
puting the coefficients on groups of subjects more diverse h 
which the tests are actually to be used. This has been done 
in determining the reliability of raw scores or mental 
gence tests by using groups of children ranging in a 
through twelve years. This makes the standard devi 
larger than if a representative sample is obtained 

If a test is to be used consistently, 


either gives no 
fficient without saying how 
d with suspicion. In particular, 
viation of scores for the group on 


andard deviations are 
subgroup which is used in the 
ctions can be made (see 3, pp. 96-99; 


than those in 
most frequently 
ages on intelli- 
ge from, say, nine 
ation of scores much 


at one age level only 
and especially if it is 


à commercially 
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published instrument, thorough reliability studies should be made. The 
results should then be fully reported, including (1) the method used to 
estimate the reliability coefficient, (2) the standard deviations of scores, 
(3) the standard error of measurement, and (4) a description of the 
sample of persons used in the reliability study. 
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CHAPTER 7 


Multivariate Prediction 


The use of one predictor test to forecast an assessment was discussed - 
Chapter 5. In many testing programs there will be several predictor nae 
used, and there are often several assessments to which the predictors ue 
applied. A typical situation is the selection of students for graduate 
school. A battery of tests might be employed, including, say, vocabu- 
lary, mathematical reasoning, mechanical aptitude, knowledge of current 
events, and a personality test of introversion. The tests could be used to 
select students for different graduate schools, such as engineering, psy- 
chology, mathematics, journalism, physics, and others. Multivariate pre- 
diction problems are much more complicated than the relatively simple 
Strategy of using only one predictor test. This chapter will discuss some 


of the general principles and techniques for dealing with multiple 
measurements, 


Multiple Correlation. First we will discu 


eral tests are used to predict one 
Scores from two or more tests 


ss the situation in which sev- 
assessment. It is generally the case that 
can be combined in such a way as to 
improve the predictions that can be made from one test alone. Much the 
same is done in everyday life when we gather different kinds of informa- 
tion before making decisions. Suppose, for example, that you are thinking 
of buying a certain make of automobile, For one source of 
you might talk with a person who owns one of the 
reactions. Not being satisfied with the one sourc 
might discuss the automobile with 
consult the reports of consumers? 
information gave f. 
feel much more cc 
would be using 


information 
automobiles to get his 


e of information, you 
à mechanic, take 


magazines. If 
avorable indications about the 
nfident in making the purch 
à number of sources of informati 
faction with the particular make of automobile, In like fashion, drawing 


information from a variety of tests enables us to make a better prediction 
of the performance of individuals. 


In order to discuss the use of 
it will be helpful to return to 


à trial drive, and 
all of these sources of 
automobile, you would 
ase. In this instance you 
on to predict future satis- 


à battery of tests to predict 
the concept of “y. 
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an assessment, 
ariance explained" as it 
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was mentioned in Chapter 5. There it was said that the squared correla- 
tion coefficient indicates the per cent of the assessment variance which 
can be “explained” by a predictor test. Partitioning variance in this way 
has its most direct meaning in the mathematical statistics of prediction, 
and it is sometimes difficult to see its common-sense meaning. In Chapter 
5 the per cent of variance not explained was stated as 


0 


Per cent of variance not explained = 
L4 
( More properly, it is a variance ratio, a fraction rather than a per cent; 
but we customarily read ratios of variances without the decimal point, 
or, in other words, as per cents.) By subtracting the per cent of variance 
not explained from 1.00, the per cent of variance explained is found: 


Per cent of variance explained — 1 — 


Another way of looking at the problem is that there are individual differ- 
ences in assessment scores and that the size of these individual differences 
is indicated bv the variance. The strategy is to use the individual differ- 
ences on the predictor test to predict individual differences, or the 
al differences, in assessment scores. The larger the 
the more effective the test is in pre- 


variance of individu 
per cent of variance explained, 7°, 
dicting individual differences on the assessment. TAN 

One way to understand the prediction problem is to think in terms of 
geometric designs showing the amount of overlap of predictor tests with 
the assessment. Here we will use 
Squares to represent the variances 
of the predictors and the assess- 


ment. Imagine that our purpose is X Y 
Padi e rade averages, 

to predict college grade averag Vocabulary Grade 

and the first test that we employ, a test e 


vocabulary test, correlates .50 with 
grade averages. This means that 25 


per cent of the assessment variance 
is explained by one predictor, vo- 
cabulary. The overlapping variance 
can be illustrated as in Figure 7-1. 
The vocabulary square has been p 
grade average squar 
cardboard square ove 

Variance diagrams 
ness when we consider the 


laced o 
e. The same picture W 
r 25 per cent of the 
like that in Figure 

application 0 


Ficure 7-1. Variance diagram showing 


correlation of .50 between a test and 


an assessment. 


ver 25 per cent of the area of the 
sould be obtained by laying one 
area of another cardboard square. 
7-1 begin to show their useful- 
f more than one predictor test. 
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Suppose that in addition to the vocabulary test, an 1 gee 
given to the college students. It correlates 50 with grade in à ie 
zero with vocabulary. In other words, introversion explains e wie 
per cent of the grade average variance. A diagram showing t 2 Sid ap 
of both vocabulary and introversion on grade nde eed oni = 
Figure 7-2. The vocabulary and introversion tests together explain 50 

cent of the variance of grade averages, 


Ficure 7-2. Variance diagram showing tw. 


© tests which each correlate .50 with an 
assessment. The correlation between the tw. 


o tests is .00. 


^ A X, 
Vocabulary Grade Introversion 
test averages test 


The combined variance 
be used to form a correlation coefficient. 
tion, and it will be labeled with a c 
case r that is used for the biv. 
for multiple correlation in 


explained by a number of predictor tests can 


It is called the multiple correla- 
apital R instead of with the lower 
ariate, or “simple” correlation. The symbol 
its complete form is R, » The symbol 
means the correlation of all the tests 1, 2, 3, up to D. with the assessment 
Y. The multiple correlation for the problem in Figure 7-2 would be sym- 
bolized as Ry 12 Tn Figure 72 the 50 per cent of the variance explained 
means that there is a multiple correlation of approximately .71, obtained 
by finding the square root of .50, 

If a third test is found which correlates zero w 
introversion, the variance of the assessment which it explains can be 
added to that explained by the other two. Imagine that a mathematics 
test is used which correlates zero with both vocab 
and .30 with grade averages. The three inde 
assessment variance would then sum to 
vocabulary, introversion, and mathem 
77, the square root of .59, 

If we could continue to find new tests w zero with the 
tests already being used and w to some extent with the 
assessment, the per cents of variance explained would accumulate until 
the assessment is predicted perfectly. In this unusual circumstance in 
which the tests all correlate zero among themselves, the multiple correla- 


ith both vocabulary and 


ulary and introversion 
pendent portions of the 
59. The multiple correlation of 
atics with grade averages is then 


hich correlate 
hich correlate 


Multivariate Prediction TIT 


tion can be obtained very easily. The squared multiple correlation equals 
the sum of the squared correlations of the predictors with the assessment. 

Correlated Tests. The strategy of employing a battery of tests to pre- 
dict an assessment is more complicated than the description given so far. 
If correlations among predictor tests were in fact zero, the multiple corre- 
lation would be easy to compute and easy to understand. However. 
correlations among tests are almost never zero, and in most cases, the 
correlations are substantial. Then the portions of variance explained by 
the different predictor tests are no longer independent and can no longer 
be summed to obtain the squared multiple correlation. Going back to 
the illustration in Figure 7-2, imagine that vocabulary and introversion 
cach correlates 50 with grade averages but, now, imagine that the corre- 
lation between vocabulary and introversion is .50 instead of zero. The 
overlapping variances are shown in Figure 7-3. In Figure 7-3 each vari- 


Grode averages 


Ficure 7-3. Variance diagram showing 
two tests which correlate .50 with an 
assessment and in which the correlation 
between the two tests is also .50. 


Introversion 


Vocabulary 


areas of the other two variables. Vocabu- 
aUpnvnope z 9 me 
a of grade averages and 25 per cent of 
introversion covers 25 per cent of grade 
average area and 25 per cent of the vocabulary area. With the addition 
EU oe neni po RAVARAI N A aa 
of e er between vocabulary and introversion, Figure 7-3 can 
not be used to obtain the multiple correlation Veg deci the 
à ful i i a s ion is obtained. 
ful in showing how a solution is obta 
three areas in which grade average is over- 
edictors. The area with vertical lines is a 
xplained by vocabulary alone. The area 
of the assessment that is explained by 


able overlaps 25 per cent of the 
lary covers 25 per cent of the are 
the area of introversion. Similarly, 


diagram will prove use 

In Figure 7-3 there are 
lapped by at least one of the pH 
part of the assessment that is e 
with horizontal lines is a portion 


118 Tests and Measurements 


introversion alone. The dotted area is a portion of the variance of the 
assessment that is explained in common by vocabulary and introversion. 
In Figure 7-2 where there was no overlap between vocabulary and 
introversion, the areas of overlap with grade averages could be added 
to obtain a squared multiple correlation of 50. Because the predictors 
do overlap in Figure 7-3, it would not be correct to add the per cents of 
variance each has in common with grades. Instead, the correct estimate 
of the variance of grades explained by the two predictors must be some- 
what less than the 50 per cent that was obtained from Figure 7-2. To the 
extent that the tests correlate among themselves, the Squared multiple 
correlation of the tests with the assessment will be different from the 
sum of the squared simple correlations of the tests with the assessment. 
First we will look at some examples of multiple correlations obtained 
where the correlations among predictor tests are not zero, Then, 
go on to see how the multiple correlation is obt 
The symbol 7,1 Will stand for the correl 
predictor test 1. The symbol ry» will stand for the correlation between the 
tests 1 and 2, Considering two tests each of which correlates 50 with an 


assessment, the multiple correlation for varying degrees of relationship 
between the tests can be seen as follows: 


we will 
ained in the general case. 
ation between the assessment and 


(1) r4 2.50 
772 = .50 R= 71 
712 = .00 


50 
50 R= 62 
30 


50 R= 54 


(5) ra =.50 
Tye = 50 R= 38 
712 = .80 
In the examples above, it can 


dwindles as the correlation between th 
correlation between the 
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drops to .62. When the correlation between the tests reacl 7i 
multiple correlation goes down to .54. From that point mapa e b 
correlation. drops off very slowly, going down only 1 E 
correlation between the two predictor tests goes from 70 r 80. id 
The multiple correlation would reach .50 only when the pum lati 
5 the two predictors became 1.00. A perfect correlation ies 
ne 7 sete ce. a ici n i i 
ee 
\ à rsion. Then they 
would both necessarily overlap the grade average area in exactly the 


same place and the same amount. Consequently the multiple correlation 
lation of either predictor 


would have to be the same as the simple corre 
with grade averages. 

The multiple correlation will always be at least as large as the largest 
correlation between any of the predictor tests and the assessment. That 
is, if one of the tests correlates .50 with the assessment, no matter what 
found, the multiple correlation cannot be smaller 
saying that when using multiple correla- 
a battery cannot lower the predictive 


other correlations are 
than 50. This is another way of 
tion the addition of tests to 
efficiency that is already present. 
Multiple Regression. Now that some background has been obtained 
ation, the technical aspects of the problem can be dis- 
lation coefficient was discussed in Chapter 5 
or the equation for the best-fit line. 


egression equation was given as 


on multiple correl 
cussed, The use of the corre 
in respect to the regression equation, 
The standard score form of the r 


z = 1%. 


^W * 


variable to be used in predicting an assess- 


7 a n 
When there is more than one 
an be extended as follows: 


ment, the regression equation C: 
E 
PA 
= z= 2 222 P343 
af = pizi + Bate Bra 


andard score on 


(7-1) 


«+ o+ Božo 


where z, is an estimated st: an assessment (college 
srades 

ET B the eu score actually obt 
(sav, vocabulary ) 

pi is the weight for 


ained on the first predictor 


the first predictor 


at, for any P number of tests, there 
andard scores of the predictors 
e best estimate of the assess- 
the standard for a “best esti- 
lled regression weights, or 


tion says is th 
to multiply the st 
scores will give th 
erion is used as 


find the Bs, ca 


What the above equa 
are weights by which 
such that the composite 
ment. The least-squares ipe 
mate." The problem is then te 
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DCi we S ve answer is ively sir "he 1e. problem entails 
beta ight answer relatively s nple when t 

a weights. SV 18 \ l 

only two predictors: 


The formulas can be illustrated with the following correlations: 
: 50— (60x 30) , 
hy = 50 61 EY 30): = 352 
Ty» = .60 


60 — (.50 > 
memu ike ^ 1—(30 
The obtained beta weights can be substituted into the multiple regression 
equation to predict assessment Scores. For an individual with standard 
scores of 1.00 and .50 on tests 1 and 2 respectively, the prediction would 
be as follows: 


Z, = 352 (1.00) + .499 (.50) 
z= 80 

The prediction is th 
assessment of .60, 


The beta w 
I Towever, 


at this person will make a standard score on the 
eights can be obtaine 
the. procedure 
laborious after more th 
obtaining beta we 
most advanced te 


d for any numbe 
for obtaining the weig! 
an two predictors are used. The procedures for 
ights for more than two predictors c. 
ts on statistics (see 2, 4, 7) 
After the beta weights are obtained, they 
mine the multiple 


r of predictor tests. 
hts grows increasingly 


an be found in 


in turn can be used to deter- 
correlation: 


Be = Bir- Ha , - + Buri, (7-3) 
ation is found by 
ation by the beta weight for the pre 
over the number of predictors being used. The be. 
tions in the example above can be used to illustr. 


The squared multiple corre] multiplying each predictor- 
assessment corre] dictor and summing 
ta weights and correla- 
ate the computations: 
352 (.50) + 492 (60) 

R? 471 


R= .69 


Re 


Il 


l 
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What the multiple regression equation does is to transform a multi- 
variate comparison to a bivariate comparison. That is, at the start of the 
problem each person has a number of predictor test scores. The multiple 
regression equation combines the separate predictor scores into one score 
(S,). The question is then, "How well do the z’ scores correlate with 
the actual assessment scores (2,)?" The answer is given by the coefficient 
of multiple correlation. Formula (7-3). 

The Standard Error of Estimate in Multiple Regression. In Chapter 5 
the standard error of estimate was described in the bivariate prediction 
can be derived from a multiple correlation: 


=i = (7-4) 


problem. A similar measure 
est 


The only difference between the formula above and that used in the 
bivariate regression problem is that the multiple correlation, R, is substi- 
tuted for the simple correlation, r. The standard error of estimate obtained 
with the multiple correlation has the 


in the bivariate problem. It can be use 
standard error of estimate with a multiple correla- 


same properties as that obtained 
d to set confidence intervals for 


predicted scores. The 
tion of .69 would be: 


Oost = 


ade of standard scores on the assessment, o, 
the standard error of estimate is .72. If we 
eviation scores whose standard deviation 


If predictions are being m 
equals 1.00; and consequently, 
were predicting either raw or d 
is 2.0, the standard error of estimate would be 144. - 

Sampling Error of the Multiple Correlation. In Chapter 5 we discussed 
the sampling error problems in using simple correlations. The EO TINGE 
important considerations are: (1) determining whether a particular 
correlation is significantly different from zero correlation, and (2) deter- 
mining whether two correlations are significantly different from each 
other. These two problems also occur in the use of multiple correlation. 
The first type of problem can be illustrated with two tests that have no 
predictive validity for a particular assessment: the population values of 
the correlations between the tests and the assessment are zero. In any 
particular sample of persons from the population, the obtained correla- 
tions would probably not be exactly ERER More to be expected is that 
the correlations obtained from successive samples would range about 
zero, with a dispersion dependent on the sample size. 

If the multiple correlation formula is applied to the chance coefficients 
obtained, in our hypothetical problem, it will make the most of the 
chance relationships. It was stated previously that the multiple correla- 
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tion will never be smaller than the largest Pa EU. ee 
tion. In nearly all cases, the multiple correlation will be liger BDE 
of the simple correlations between the tests and the a ibi 
by only 1 or 2 points. The multiple correlation tends to take ad poi 
of chance” to obtain a maximized coefficient. For this reason, cie 
formulas must be used to determine the significance of multiple corre * 
tions (see 2, p. 399; 4, pp. 276-280). A discussion of these formulas 
would be too technical for inclusion in this book, but 
studied when need arises. 

The second problem, that of determining the 
ence between two multiple correlations, is 
in testing. This chapter began by conside 
tests to a battery will improve the prediction of an assessment, Suppose, 
for example, that we are using a vocabulary test and a mathematical 
reasoning test to predict how well students will perform in psychology 
training. We might then want to add new tests to improve the power 
of prediction. For this purpose a personality test might be used. Imagine 
that the multiple correlation with the grades in psychology training is 
-50 before the personality test is added and .58 after the personality test 
is added. Before we can consider the practical importance of the ap- 
parent gain in prediction, it must be determined whether or not the 
difference could likely be due to chance. A statistical check should be 
made to provide some assurance that the 
prediction of the assessment. 

Because of the tendency for multiple regression an 
vantage of chance,” the beta weights obtained in a sample will produce 
a higher multiple correlation in that sample than they will when applied 
to a new sample of persons. This is the same as saying that the multiple 
correlation tends to “shrink” when the beta weights are applied to a 
new group of persons. In the example above, the beta weights applied 
to the same sample on which they were derived produced a multiple 
correlation of .69. If the beta weights are now applied to a new group of 
people, the predicted assessment scores will probably not correlate .69 
with actual assessment Scores. The odds are that using the beta weights 
obtained from the first sample will lead to a lower multiple correlation in 
the second sample. The shrinkage is small if the first sample is large, say, 
over 300 persons; but if the beta weights are determined originally on 
as few as 50 persons, the shrinkage is often substantial (see 4, pp. 185- 
186). The amount of shrinkage is also related to the number of predictor 
tests being used, The more tests there are, the more opportunity there is 
to "take advantage of chance,” and consequently, the more shrinkage 
to be expected. The proper way to determine the predictive efficiency of 
a set of beta weights is to “cross validate.” That is, the beta weights 


they should be 


significance of the differ- 
à very real, practical problem 
ring the likelihood that adding 


new test actually improves the 


alysis to “take ad- 
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should be determined on one sample of persons and then tried out on a 
second sample. The correlation between the predicted scores in the 


second sample and the actual assessment scores is the best estimate of 


the predictive efficiency of the battery. 


SELECTION OF TESTS FOR A BATTERY 


When only one test is to be used in predicting a particular assessment, 
there is little difficulty in deciding how to choose the test. The test that 
correlates the most with the assessment would be chosen. When a battery 
of tests is being formed, the matter of choosing tests is not so simple. The 
strategy is to select the fewest tests which will give the largest multiple 
correlation with the assessment. This may make for some apparently 
s in some situations. To picture one such situation, con- 


unusual choicc 
n problem for two tests and one 


sider the following multiple correlatio 


assessment: 
Tu = .00 
fyc 60 R=.70 
112 = 50 


The first thing that strikes the eye is that 1 correlates zero with the assess- 
ment. If this test were being considered separately as a predictor of the 
assessment, it would be of no use. However, the multiple correlation 
demonstrates that test 1 can be used to improve the prediction that is 
obtained from test 2 alone. A test which correlates low with the assess- 
ment and high with one of the other predictor tests is called a suppressor 
test. That is, it works to suppress, or cancel out, a part of the variance of 
the second test in such a way ve the multivariate prediction. 


as to impro 
above the beta weights for tests 1 and 2 would be —.4 
The regression equa 


go — Az, + 822 
v 


s way test 1 works to suppress, or cancel out, the 


In the example 


and .8 respectively. tion would then be: 


You can see how in thi 
invalid variance in test 2. AS x 
sor test is given to show that it is easy to 


The example of the suppres ; 
ABE All be until a statistical analysis 


be fooled about how useful a new test will be 
lexity of intercorrelations among predictors and 


is undertaken. The comp s a : 
the assessment makes it difficult to judge the utility of a particular test 
without first working out t 

Although the purpose © : 
measurements, it should be sa! 


he multiple correlation. 
f this chapter is to show the value of multiple 


d that multiple regression analysis is not 
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a panacea for the person who works with tests. The usual experience is 
that the multiple correlation grows only very slowly as the battery of tests 
increases from three, to four, to five, and so on. A substantial increase in 
predictive efficiency is often obtained from adding a second test to an 
initially good predictor, and in many cases a third test will] prove valuable. 
However, in only a few cases has it been found that using more than 
four or five tests adds significantly to the predictive efficiency of the 
battery. 

The apparent limit to the gain from adding new tests is not an intrinsic 
characteristic of the prediction problem. It is due to the f 
logical tests tend to duplicate each othe 
difficult to find more than four or five truly different kinds of measures. 
There is always the hope and the possibility that a new type of test 
will add to the predictive power of existing test batteries, 

There is usually a practical limit to the size of a test battery, The 
question is whether or not the gain in predictive power obtained from 
using a new test is worth the expense and effort. For example, if a test 
raises the multiple correlation of a battery from .50 to .55, it can be asked 
whether this gain is sufficient ground for including the test in the battery, 
The answer to that would depend on the testing situation. For example, 
if the expense of training a person in a particular occupation is large, 
às it would be in most graduate school programs or as it would be in 
training a pilot in the Air Force, then a gain of 5 points in predictive 
power could well justify the use of the additional test. 


act that psycho- 
r's content so much that it is 


DIFFERENTIAL PREDICTION 


Earlier in this chapter it was said that the te 
a number of assessments to be predicted, as was illustrated. with the 
selection of students for various graduate training. programs, 
that only one test could be used to select persons for the different gradu- 
ate schools. The test would then have a Separate correlation with the 
grades in each of the graduate training programs. It might be found that 
the test is better able to predict the grades of students in psychology 
than in physics. If the test correlates positively with all of the : 
ments, even though it correlates higher with some th 
higher the individual's score on the test, the more lik 
in any of the programs. An over-all index of 
would be of some use but would be little help in making decisions about 
which students should be allowed to enter particular programs, 

Generally we would find a battery of tests, instead of 
being used to predict a group of assessments, For e 
à vocabulary test, a reading comprehension test, 


sting situation may entail 


Suppose 


assess- 
an others, then the 
ely he is to succeed 
aptitude for graduate school 


a single test, 
xample, we might use 
a mathematical reason- 
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ing test, and an interest test in a battery. A multiple correlation could 
be obtained between the test battery and each assessment. With each 
multiple correlation would be obtained a set of beta weights. There 
would be a different regression equation for each assessment to be pre- 
dicted, and the beta weights would probably not be the same in the 
prediction of the different assessments. For example, the vocabulary test 
might correlate highly with grades in the history department but very 
little with grades in engineering. Then the beta weight for vocabulary 
would be different in the prediction of success in the two programs. A 
prediction could be made as to the program in which an individual would 
most likely succeed by trying out his test scores in the regression equa- 
tions for cach assessment. He would then have a predicted score for each 
program. The student could then be advised that, whereas he probably 
would not succeed in certain school programs, he probably would make 
the grade in others. 

In the actual use of test batteries to predict multiple assessments, a 
number of nontest factors must be taken into consideration. If persons 
were being allocated to different training programs in the Armed Forces, 
it would be necessary to consider the time and expense which is involved. 
If the training program is relatively expensive, it would be wise not to 
accept an individual unless his predicted performance is very high. It is 
also necessary to consider the number of men needed in different special- 
ties and the number of people who can be trained at any one time. If 
there is a shortage of. say, airplane mechanics, it would be wise to lower 
the standards somewhat by “gambling” on persons whose test scores fore- 
cast only moderate success. In nearly all situations the individual's own 
interests and desires are taken into account to at least some extent. 

We need differential prediction because different jobs often require 
different abilities. For example, it may be that high scores in the vocabu- 
lary test and the reading comprehension test are essential for a person 
in journalism, but it might matter little what score the person makes in 
mathematical reasoning. If we find a person who scores high in the first 
two tests, he would be a good bet for the journalism department even 
though he has a very low grade in the mathematical reasoning teet. If 
instead. of using differential prediction, the student's test scores are 
averaged to form one general index of ability, the score in mathematical 
reasoning might make the average very low. On the basis of the general 
index of ability; the student might be refused admission to graduate 
school, without account being taken of the fact that his pattern of abil- 
ities makes him suited for one of the programs. Whenever there are a 
number of assessments in the selection situation, the potential advantage 
is with differential prediction rather than with the use of only a single 


test variable. 
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DISCRIMINATORY ANALYSIS 


A handy way to study an individual's scores on a battery of tests is to 
present the information in the form of a graph, or profile, as shown in 
Figure 7-4. Figure 7-4 shows the standard scores for two persons on five 
tests. There are a number of characteristics of profiles that prove helpful 
in testing (see 1, 5). 


Ficure 7—4. Test profiles for two persons. 


Test 


Vocobulory 


Mothemotical 
reosoning 


Current events 


Mechonicol 
optitude 


Extroversion 


-3 -2 —1 0 1 2 3 
Standord scores 


Level. One important feature of a test profile is the level, or the extent 
to which the person does well on the tests as a whole. The level is 
defined as the mean standard score made on the tests by an individual. 
From Figure 7-4 we find that person a has a mean of .78 and person b has 
a mean of —.01. Then it is said that the level is higher for a than for b. 

The profile level provides direct information about people only if all 
of the tests correlate positively with the assessment being predicted. If 
all the tests correlate positively with the assessment, high test scores are 
"good" in the sense that they portend successful performance. If the 
profile concerns scores on tests of the ability type only, it is usually the 
case that all tests correlate positively with the assessment, even if the 
correlations are very small in some cases. Therefore, the level is usually 
a meaningful measure when all of the tests concern human abilities. 
Some cautions must be exercised in interpreting the profile level if some 
of the tests measure nonability type functions, such as interests, attitudes, 
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and social traits. In Figure 7-4 there is one test of the nonability type, 
the test for extraversion. Consequently, adding in extraversion scores with 
the four tests of the ability type would be meaningful only if extraversion 
were important in the particular assessment being predicted. 

1f all the tests correlate positively with the assessment, and if we have 
no information about people except the profile level, we would choose 
the people with the highest levels. A high profile level means that the 
person scores generally high on all of the tests, even though he may 
score higher on some tests than on others. 

Scatter. Scatter concerns the extent to which an individual does better 
on some of the tests than on others. The standard deviation of the in- 
dividual's scores on the separate tests serves as an index of scatter. In 
Figure 7-4 each person has five standard scores. The scatter for each 
person is computed by substituting his five standard scores in the stand- 
ard deviation formula. Computing these indices for the two profiles in 
Figure 7-4, there is a standard deviation of scores for person a of 1.62 and 
a standard deviation of 1.08 for b. This shows that the scatter for a is 
more than that for b. An approximate measure of scatter can be obtained 
erson's scores. Person a does best on the 
current events test and worst on the mechanical aptitude test. Subtract- 
ing the lowest score from the highest, a range of 4.0 is found for person 
a. In a similar way, a range of scores of 2.8 is found for person b. In 
comparison to other persons with the same profile level, a person with 
a wide scatter is likely to do relatively well on some jobs and relatively 
poorly on others. À person who has a very small amount of scatter is likely 
to do as well, or as poorly, on all of the performances being predicted. 

Shape. There is one more important type of information in the profile, 
which is called the profile shape. The shape concerns the ups and downs 
in ability shown in the profile. The shape is defined as the order of test 
scores for the individual. Person 4 makes his highest score on the current 
events test, his next highest score on the vocabulary test, and so on to 
his lowest score which is made on the mechanical aptitude test. In addi- 
tion to the differences between persons 4 and b in terms of level and 
scatter, there are differences in profile shape. Person b makes his highest 
score on the mathematical reasoning. test and his lowest score on the 
xooabularvstest The profile shape offers a way of fitting an individual 
toa particular job. In so far as the individual does better on some tests 
than others, he can be placed on a job which capitalizes on his strongest 
points. The use of profiles in this way is an alternative to the multiple 
regression approach. 

The reader should b 
literally anything about the p. Á 
refers strictly to the individual's ran 


by using the range of the p 


e warned not to take the term shape to mean 
hysical appearance of the profile. The shape 
k-order of performance on the tests. 
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A naive individual studying a test profile might mistakenly impute mean- 
ing to the “picture” represented by the profile. If the individual scored 
lower on the even-numbered tests than on the odd-numbered tests, this 
would give the profile a jagged appearance. This might mistakenly be 
assumed to represent something about the individual. However, the 
ordering of tests in the profile is arbitrary. The tests can be reordered to 
make the profile “look” quite different. However, this will have no effect 
on shape, as it is defined above, or on level and scatter, 

The Standard Error of Difference Scores. Before practical use can be 
made of profile information, it must be determined whether differences 
in test scores within the profile are statistically significant. For example, 
it can be seen from Figure 7-4 that person a makes a standard score of 
2.5 on the vocabulary test and .5 on the mathematical reasoning test. If 
it can be shown that this is a real difference, a difference not due to the 
unreliability of the tests, the difference would be useful in predicting 
how well person a would do on particular jobs. 

In Chapter 6 we discussed the reliability of sep 


arate tests. Difference 
scores also have some measurement error, 


or unreliability. If we gave 
person a the test battery on two separate occasions, we could see if the 
difference between the scores on the vocabulary test and the mathemat- 
ical reasoning test would still be 2.0 standard score units. Perhaps the 
second time the difference score would be 1.5 or 2.5. 

The reliability of a set of difference scores ! 


can be determined from 
the separate reliabilities of the tests as follows: 


Eu y — 2n 7 
Tit z5— 2(I—mns) = (7-5) 
where rar is the reliability of differences betwe 
tests 1 and 2 

inis the reliability of the first test 

rv is the reliability of the second test 

ie is the correlation between the two tests 
Let the reliability of the first test be .85, the reli 
be .95, and assume that there is a correl 
These values can be substituted into the 


en standard scores on 


ability of the second test 
ation of .65 between the 


tests. 
above equation: 


Tait = 


You are probably struck by the fact that the reliability of the diffe 
scores is less than that of either of the se 
be the case when the correlation betwee 
seen from inspecting the formul 
scores decreases as the correl 


rence 
parate tests. This will usually 
n tests is positive. It can also be 
a that the reliability of a set of difference 
ation between the tests becomes larger. 
* The formula assumes that differences are obt 


ained between sets of standard scores. 


Multivariate Prediction 129 


Thus, if in the illustration above the correlation between the two tests 
had been .75 instead of .65, the reliability of the difference scores would 
drop to .60. It would have worked the other way had the correlation 
between tests been negative. In that case, the difference scores would 
have been more reliable than the average reliability of the tests. 

The reason for the decrease in reliability of difference scores as the 
correlation between tests grows larger is that as the correlation increases, 
the difference scores become smaller and smaller. If the correlation be- 
tween two tests is 1.00, people have the same standard scores on both 
tests, and consequently, the score differences are all zero. If the correla- 
tion were — 1.00, the differences would be as large as possible and, con- 
sequently, as reliable as possible. 

It is unsafe to interpret difference scores unless the individual tests are 
highly reliable and the correlations among tests are relatively low. In 
the extreme case where the correlation between two tests is the same 
as the average of their individual reliabilities, the reliability of difference 


scores is zero. This would happen if, for example, the respective re- 
ation between tests is .75. Fortu- 


liabilities are .70 and .80, and the correl 
nately, such extreme cases are unlikely to occur in the use of tests; but 
it is ‘sometimes the case that correlations among tests are so high and 
individual reliabilities are so low as to make difference scores highly 


unreliable. 
Prediction from Test Profiles. Assu 


ming that the cautions regarding the 
reliability of difference scores have been heeded, the individual’s test 
profile offers a basis for predicting how well he will perform in particular 
situations. A useful beginning is to compute a composite profile for each 
assessment to be predicted. This would be the profile of mean scores for 
rform well. For example, if we study the mean test 
it might be found that they make above- 
high scores in current events, low scores 
in mathematical reasoning, and so on. The policemen could be compared 
with a group of successful accountants, who might be found to make 
high scores in mathematical reasoning, low scores in extraversion, high 
and so on. For each group, the mean test scores 
a composite profile for each job. In Figure 7-5 
veen drawn for the two groups, and the profiles 
and b, have been included to show how the composite 
predictions. (Persons a and b shown in Fig- 
ure 7-5 are not the same as persons 4 and b shown in Figure 7-4.) 

A most useful type of information would be the ideal profile for each 
job. This would be found from the mean test scores of excellent police- 
men and excellent accountants. Then, if an individual's test profile closely 
resembles the ideal profile, it is a good bet that he will do relatively well 


the persons who pe 
Scores of successful policemen, 
in vocabulary, 


average scores 


scores in current events, 
can be plotted as a profile, 
hypothetical profiles have I 
for two persons, à 
profiles are used in making 
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on the particular job. Although we do not have the ideal profile for each 
job, we do have the composite profile of successful men. In other words 
we have the profile of satisfactory policemen but not necessarily that of 
excellent policemen. 

Before people can be classified for different jobs, some way is needed 
of determining how similar one test profile is to a composite profile. This 
might be done impressionistically by simply looking at the similarities and 


Ficure 7-5. The test profiles of two persons compared with the composite profiles for 
policemen and accountants. 
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differences. Looking at Figure 7-5 it can be seen that the profile for 
person a is more similar to the composite profile for accountants than to 
the composite profile for policemen. The profile for person b appears 
more similar to the profile for policemen than to the profile for account- 
ants. Although in extreme cases a simple inspection of similarities and 
differences would suffice to assign people to jobs, it would not be sat- 
isfactory for less clearcut cases. What is needed is a measure of the 
similarity between test profiles. 
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In order to explain a technique for measuring the similarity between 
profiles, it will be helpful to give a geometric interpretation of test 
scores. In Figure 7-6 the mean scores are shown for policemen and ac- 
countants and for persons a and b on the mathematical reasoning and 
We could have plotted all of the een ac- 
e graph instead of only their mean scores. Then 
ormed a swarm of dots extending out 
t of mean scores is referred 


extraversion tests only. 
countants' scores on th 
the accountants would have f 
around their mean location. The plot of a se 
Ficure 7-6. Graphic representation of scores on the extraversion and mathematical 
reasoning tests. 
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to as a group centroid. It is the point about which the scores for the 
group balance in all directions. Likewise, the mean scores for the police- 
men define a centroid, one that is different from that for accountants. If 
each person in the group of acc ts had been plotted instead of the 


ountan 
group means, it would have shown the density of the group, or the degree 
to which the group is spre centroid. It might have 


ad out around the 

been found in this way that the accountants are more densely ordered 

about their centroid or more homogeneous in respect to test scores. 

Instead of plotting the scores for only two tests, a three-dimensional 

model could have been constructed to show the scores for three of the 
tests. This could be done with 


a boxlike structure in which balls are 
hung to represent the scores on three tests. In this way we could obtain 
a physical representation of the scores made by the two groups and by 
persons a and b on the vocabulary, mathematical reasoning, and extra- 
version tests. The mean group scores would then locate centroids in three 
dimensions. If each of the persons in the group of successful accountants 
was plotted instead of only the centroid, there would be a scattering of 
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the persons about the centroid in all directions. The method that will be 
used to compare test profiles and the statistics that will be developed in 
respect to the method can be employed with any number of tests. Of 
course, when the number of tests is more than three, there would be no 
way of constructing a physical model of the scores; but this will in no 
way affect the generality of the concepts and mathematical procedures 
that apply. The mean test scores for a group of persons on five, six, or 
any number of tests can still be thought of as defining a centroid in that 
number of dimensions. 

The Distance Measure, D. In Figure 7-6 it is apparent that the cen- 
troids are different for accountants and policemen. It would be helpful 
here to have a measure of how far apart they are. The statistic that has 
proved useful for this purpose is called the D measure. It could be used 
quite simply to determine the distance of the two groups from each other 
as follows: 


Dj, = (M,,— Mp)? + (Mz, — Mpp) (7-6) 


where D,, is the distance between the centroids of accountants, a, and 
policemen, p 
Mi, is the mean score of accountants on test 1 
Mj, is the mean score of policemen on test 1 
M», is the mean score of accountants on test 9 
M», is the mean score of policemen on test 2 


d differences between 
n the two centroids. 
The D measure is a use of the 
Aesoustonts Pythagorean theorem, which says 
Distance that for a right triangle the sum of 
between he squares: of Ad 8 
5 1 5 the squares of the sides is equal to 
the square of the hypotenuse. How 
this works is illustrated in Figure 
“Policemen 77. The D measure can be applied 


What the formula says is that the sum of the square 
the means equals the square of the distance betwee 


Meon 
difference in 
mathematical 
reasoning 


— 


Mean difference to the scores of individuals to show 
in extraversion 


their distance from one another and 


Ficurr 7-7. Graphic representation of from the centroids for particular 


mean differences and over-all distance 95 n 
ý roups. If we want to find the dis- 
between the scores of accountants and Sron] 5 nd the dis 


policemen on the extraversion and mathe. tance of person a from the centroid 
matical reasoning tests. for accountants, the mean score for 

accountants on mathematical rea- 
soning is subtracted from the score made by a and squared. This is then 
added to the squared difference between the score made by person a and 
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the mean for accountants on extraversion. The square root of this quan- 


tity is then the distance of person a from the centroid for accountants: 


D? = (2.8 —24)?+ (—24 1.2)? 


= (Ay (1.0)? 
= .16+1.0 
=1.16 

D = 1.08 


about people with the D statistic is 


to compute the distance of each person's score profile from the composite 
profile for each job. The individual would then be assigned to the job 
whose composite profile is closest to his own profile. This would be very 
easy to do in the situation in which there are only two jobs to consider 
and onlv two tests on which to base the decision. It would only be neces- 
sary to find the distance of a person from the two centroids. Considering 
only the mathematical reasoning and extraversion tests, person d is 4.08 
from the centroid for policemen and 1.08 from the centroid for account- 
ants, 

There will generally be more 
The D measure can be used with any 


formula as follows: 


The first step in making decisions 


than two tests in discriminatory analysis. 
number of tests by expanding the 


Zi) ＋ (a4 — Fan e , — £y)? (7-7) 


D du (Sic de. 

where Dy, is the distance between persons d and b or the distance be- 
tween the centroids of groups 4 and b 

zi, is the standard score of person a on test 1 or the mean stand- 


ard score of group 4 on test 1 
is the standard score of person b on test 1 or the mean stand- 


group b on test 1 


15 
ard score of 
between two persons, or the distance 
between the centroids of two groups. is determined by squaring the differ- 
sts, summing these, and taking the square root of 
ound that over the five tests person a has a dis- 
tance of 1.50 from the centroid for accountants and a distance of 4.59 
from the centroid for policemen. Using this procedure, person a would be 
classified as a potential accountant rather than as a policeman. (Some 
criticisms of this method of classifying people will be given later.) 
The D measure could be used to discriminate more than two groups. 


The formula says that the distance 


ences on the separate te 
the sum. In this way it is f 
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For example, in addition to policemen and accountants, there might 
be composite profiles for successful social workers and for successful 
plumbers. For each group there will be a centroid. If the centroids are 
different, the individual can be classified in terms of the centroid which 
his score combination most nearly approaches. Theoretically it is possible 
to classify a person on the basis of any number of tests into one of any 
number of groups. In practical problems the number of groups will usu- 
ally be limited to only several, and the number of tests will not generally 
be more than a half dozen. 

The Linear Discriminant Function. Although the D measure provides 
an approximate solution to discriminatory analysis, it has a number of 
shortcomings. If tests correlate with one another, and substantial correla- 
tions are usually expected, the distance between persons on one test will 
tend to be perpetuated on other tests. Therefore, what appears to be a 
large distance between persons, or between groups, may be only a reflec- 
tion of the correlations among tests. For example, if two persons have 
widely different scores in vocabulary, they are also likely to have widely 
different scores in similar tests such as reading comprehension and verbal 
analogies. Even if the two persons had very similar scores in mathematical 
reasoning and extraversion, the differences due to the three verbal com- 
prehension tests would make the D large, which would give a distorted 
picture of the actual differences between the two persons. 

Another drawback to using the D measure as a means of classifying 
people is that it does not fully take account of the degree to which differ- 
ent variables differentiate one job from another, For example, it may 
be that there is a considerable dispersion of mechanical aptitude among 
successful policemen. This means that mechanical aptitude is not an 
important variable in choosing policemen. Then if an individual makes a 
very high or very low mechanical aptitude score, the D statistic will be 
affected by the difference between the individual's score and the mean 
score for policemen. The more proper approach is to weigh each test in 
terms of the degree to which it discriminates one job from another, 

Another factor that must be considered in discriminatory analysis is 
that of obtaining inferential statistics which will tell the probability that 
persons are correctly classified. The standard error of estimate is the 
analogous statistic used in correlational analysis. These statistics must 
consider the relative density, or scatter, of the members of each group 
about their centroid. In order to overcome the shortcoming of the D meas- 
ure, more elaborate procedures must be used. The most widely 
cedure is the linear discriminant function (see 3, 6). The mor 
procedures are modifications of the D statistic rather than b 
ferent approaches. Whereas the D Statistic is 
more complete procedures should be sought 


used pro- 
e complex 
asically dif- 
a useful approximation, the 
when need arises (see 6). 
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A COMPARISON OF REGRESSION AND DISCRIMINATORY ANALYSIS 


Two different wavs have been described for dealing with multiple 
measurements: multiple regression and discriminatory analysis. It has not 
yet been made explicit how the methods differ. The major difference is 

sion methods assume that the ideal profile 


that the multiple-regress 
touches only the extremes of the test continua. That is, the multiple- 


regression methods assume that the “more the better” on each test regard- 
less of the job, or task, being predicted. The reason for this is that the 
regression methods assume linear relationships. If there is a linear rela- 
tionship between a predictor and an assessment, then the higher the score 
made on the predictor, the better the indication that the person will suc- 
ceed. This means that the ideal score is the highest that can be made on 
the test. It may be, of course, that the correlation between test and assess- 
ment is negative. This would mean that the lower the score made by the 
individual the better is his chance of succeeding. If there is a linear rela- 
tionship between each test and the assessment, the ideal profile would 


touch one end of each test continuum. 

When dealing with human abilities, linear rel ions l 
found between tests and assessments. However, this is not necessarily 
the case. For example, it may be that very high “general intelligence” indi- 
cates that a person will do poorly in very routine tasks such as those of 


the truck driver or elevator opera- 
tor. In the areas of interests, atti- 
tudes, and personality characteris- 
tics it is more likely that some of 
the relationships between predictors 
and assessments will be nonlinear. 
A curvilinear relationship would be 
expected between school grades 
and the amount of anxiety toward Se 
school work, as pictured in Figure 
7-8. 

In Figure 7-8 the hypothesis s Ficune 7-8. Hypothetical relationship be- 
ne high- tween amount of anxiety toward school 


amount work and grades. 
s who 


ationships are usually 
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that the persons who make tl 
est grades have a moderate 


of anxiety. Both the person 

ur er * 

show very high anxiety and those who show very low 

grades in scl = 1. TI low anxiety students would make low gr ades because 
ades in school. 1 he anxie 


i i at they do not make tl 8 
they are so uninterested in accomplishment tha ) ake the nec 


essary effort. The high anxiety students are so worried about failure that 
they inn 10 attention left for school work. This is an example of the type 


anxiety make low 
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of nonlinear relationship that could not be used in multiple-regression 
analysis but could be used in discriminatory analysis. In discriminatory 
analysis the ideal profile point for the anxiety variable would be in the 
middle of the test continuum. 

Linearity of relationships should be checked by an inspection of the 
scatter diagrams before a decision is made as to the type of analysis to 
use. The discriminatory type of analysis is more general and can use the 
information provided by some types of nonlinear relationships. However, 
regression analysis is easier for most persons to understand and compute. 
When the nonability areas of measurements are more effectively devel- 
oped, it can be expected that methods of discriminatory analysis will come 
more widely into use. 

A second feature which distinguishes the two types of analysis is the 
nature of the assessment score distribution. Multiple-regression analysis 
assumes that the assessment scores are continuously distributed, or at 
least approximately so. Discriminatory analysis assumes that the assess- 
ment is a qualitative distinction, a category rather than a continuum: all 
of the people who are successful in a particular job are considered to be 
of equal merit. 


There are advantages and disadvantages in considering the ass 
as a continuously distributed variable rather than as a cat 
situations it makes more sense to assume that thc 
ability within the category 
work, sal 


essment 
egory. In many 
e are gradations of 
of success. Such would be the case in school 
esmanship, athletic ability, and job performance. 
not always possible to measure the fine gradations of 
times only the crude index of “pass” or "fail" can be used as the assess- 
ment. This is particularly so in the personality area, where it makes sense 
to validate an adjustment inventory against the 


However, it is 
ability, and some- 


dichotomized assessment 
of mental hospital patients versus nonhospital patients. To obtain finer 


gradations on the assessment would be very difficult. Some 
inherently categorical rather than continuous. Examples of these are men 
versus women, sailors versus soldiers, and plumbers versus barbers. 

Whenever the purpose of the analysis is to distinguish qualitatively 
different groups or when only a dichotomized assessment can be obtained, 
discriminatory rather than multiple-regression analysis should generally 
be used. If continuous assessments are available and relationships be- 
tween predictor tests and assessments are re 
regression should generally be used. 

In order to give any rules at all as to when multiple-regression analysis is 
preferable to discriminatory analysis and vice versa, it has been ne 
oversimplify the problems involved. Primarily it is ne 
whether the test battery is used for selection, 
In selection, the purpose is to choose only 


variables are 


asonably linear, multiple 


cessary to 
cessary to consider 
guidance, or classification. 
those people who are likely to 
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succeed in one or more of the positions involved. If regressions between 
tests and assessments are linear, multiple-regression analysis will provide 
an estimate of each person's performance in each assessment situation. In 
this way people can be selected who will likely succeed in at least one 
position, It is best to think of selection as an institution-oriented proce- 
dure: the purpose is to choose the most promising persons and to reject 
those persons who fail to give promise for at least one position. 

Classification and guidance are more person oriented. The purposes 
are to place people in positions in which they will do better and to advise 
them to trv positions in which they will do better. However, even if an 
individual is given the best placement possible, if he is relatively untal- 
ented, he may be only moderately sucessful. Such an individual would 
probably be rejected in a selection program. In classification and guid- 
ance, individuals cannot be rejected: the best placement must be found 
for each individual regardless of his over-all ability.” In guidance and 
classification, the discriminatory approach has certain advantages over the 
multiple-regression approach, especially when regressions between tests 
and assessments are nonlinear. Discriminatory analysis will help put the 
individual into a position which is shared by other individuals of the 
same general interests, personality characteristics, and abilities. This is 
likely to make the individual feel comfortable in his job and comfortable 
with his work mates. 

Obviously, the institution-oriented and person-oriented philosophies 
and procedures are somewhat at odds with each other. In the Jong run, 
it can be expected that there will be a marriage of the two philosophies 
which will incorporate the concerns of both and provide more compre- 
hensive procedures for making decisions about human beings. This is 
presently one of the frontiers of measurement methodology. 


THE MULTIPLE-CUTOFF METHOD 


e of multiple measurements for the selection 
of personnel is to set cutoff scores on each of the tests. A person is selected 
for the particular job, school, or position, only if he scores at least as 
high as the cutting score on all tests. For example, a set of cutoff scores 
for selecting students for electronics training might be 80 on an intelli- 
gence test, 50 in perceptual speed, and 65 in gato dexterity. The actual 
cutoff points would be determined in such a way as to raise the prob- 
ability that the selected individuals will succeed. For example, the cutting 


A third approach to the us 


NETS aa Rn uidance only in the extreme case where an indi- 
vil de Ma o the extent that he is advised UE 1 pues d or work at all. 
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training period or when he isa igned to a EE PESE 1 0 ? intended for 
members of the group into which the individual was originally selected. 
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score for each test might be set so that 75 per cent of the persons scoring 
at or above the cutting score succeed on the job. 

The selection ratio determines how high the cutting scores are placed. 
If there is a large number of applicants and only a relatively small num- 
ber of these must be selected, the cutting scores can be placed high. On 
the other hand, if most of the people who apply must be selected, the 
cutting scores necessarily must be low. 

The multiple-cutoff method is particularly valuable when, as is some- 
times the case, tests are predictive only on their lower extremes. This 
tends to be the case for some of the motor, perceptual, and sensory abili- 
ties. For example, minimal levels of auditory and visual acuity are essen- 
tial for a number of jobs. However, beyond the minimum requirement, 
success on the job does not increase with higher and higher auditory and 
visual acuity. 

Multiple regression is a compensatory model in that a high score on one 
test can compensate for a low score on another test. This is a sensible 
model in some situations, such as in college training, where an individual 
can, to some extent, substitute one ability for another. However, in some 
situations a high score on one measure cannot compensate for a low score 
on another measure. For example, in the operation of sound detection 
devices in submarines, the operator must have a certain level of auditory 
ability. Otherwise he will fail, regardless of how high he might score in 
intelligence and other abilities. 

The multiple-cutoff method is preferable to the multiple-regression 
model and discriminatory analysis when the tests are predictive only on 
their lower extremes. This can be seen by looking at the scatter diagrams 
comparing test performance with job performance. If the test is predictive 
only on the lower extreme, the regression will tend to be curvilinear and 
the points will be more closely packed about the regression line on the 
lower-score section of the test. An example is shown in Figure 7-9. 

The multiple-cutoff method can be used in conjunction with other 
approaches. It is usually unwise to select an individual for a position or, 
in vocational guidance, to advise an individual to take a position, if he 

, i$ very low in any one of the related abilities. Consequently, it is useful to 
set relatively low cutoff points for each test, and not consider 
vidual for a position if he is below the 
Then if, in the remaining group, te 


an indi- 
cutoff point on any one of them. 
St scores tend to correlate with job 
success in a linear manner, scores can be combined by 
sion. The final selection would be made on the basis of the combined 
Scores. If people are being selected for sever 
be repeated for e 


for weeding out t 


multiple regres- 


al jobs, the procedure can 
ach job. Each job would have its minimum-cutoff scores 
he obviously unfit, and cach job would have 


its multiple- 
regression equation for the final selection. 
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Ficunz 7-9. Relationship between test scores and job performance in which a cutting 
score would be useful for selecting personnel. 
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PRACTICAL APPLICATIONS OF TEST BATTERIES 


There is a choice in many testing situations as to whether one general 
test should be used or whether a number of different tests should be 
administered in a battery. The potential advantage is always with mul- 
tiple measurements: more information cannot lower prediction. More- 
Over, there are some circumstances in which it is especially important to 
measure the individual's differential ability. One such circumstance is 
e. The individual comes for advice; and, regardless 
-all ability (profile level), the problem is to 
find the occupations that are best suited to his pattern of abilities (profile 
shape). Even the person who makes a very low score m P eru in- 
telligence test might have certain aptitudes, such as motor dexterity and 
mechanical ability, that would suit him for some jobs. Y 
Any circumstance in which the effort is to classify persons after pre- 
liminary selection especially requires the use of multiple predictors. The 
1 Armed Services exemplify such situations. 
First, the potential officers are tested as a tione 1 gie aes d n 
officer training programs. One general measure je 7 used 3 ed for 
preliminary selection. The general measure Me ies 1 much 
of what we have spoken of as level in the test profi E I t i n id ual has 
à high level, it is certain that he is high in some of the separa 2d abilities. 
After the course of training, the new officers can then be classified, or in 


Vocational guidanci 
of how high or low his over 
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other words, placed on particular jobs. In the Army some might become 
artillery officers, some members of the intelligence sections, and others 
might 'enter communications activities. It would then be important to 
know the individual's differential ability in order to match his special 
talents with particular assignments. Whereas the general test is sufficient 
for the initial selection of men, it is necessary to give a battery of tests 
for effective classification of the men later. 

After discussing the advantages of multiple measurements, it is im- 
portant to consider why test batteries rather than single tests are not 
employed more widely in practical work. One reason is that test batteries 
are more expensive to construct and use than single tests. Another reason 
is that good test batteries require careful preparation and research, and 
it is often the case that hastily constructed batteries fail to work well in 
practice. There is little doubt that well-designed test batteries will even- 
tually come into use in many testing programs. They will pay for 
themselves by improving the guidance, selection, and classification of 
individuals. The know-how is already available to construct better test 
batteries for many uses; and it will come to fruition when administrative 
officials in schools, the Armed Services, industry, and other institutions 
supply the funds and resources which are needed. 
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CHAPTER 8 


Test Construction 


talked about how tests are studied after they are 


So far we have 
est construction. Here 


developed, but no attention has been given to t 
we enter into one of the most complex parts of measurement theory, and 
this introductory text will be able to cover only some of the general prin- 
ciples. The rules for test construction change in terms of the way in 
which tests are used. For example, if the question is asked, “Should a 
‘good’ predictor test have items that are homogeneous to one another or 
should they be heterogeneous to one another?" the answer is, “Yes!” We 
will see how this and other seemingly paradoxical rules work in special 
testing problems. 

The rules for constructing à particular instrument can be clarified at 
the outset by deciding whether the test is intended to be a predictor or 
an assessment. Although there are desirable features which should be 
present in both types of instruments, such as high reliability, there are 
standards are very different. We will begin 
redictor tests. then assessments, and 
features which they should 


other points on which the 
by discussing the construction of p 
end the chapter with a discussion of some 


have in common. 


CONSTRUCTION OF PREDICTOR TESTS 
As was cautioned earlier, before work is undertaken on a predictor 
test, some attention should be given to what it is meant to predict, the 
criterion, or assessment. In particular, studies should be undertaken to 
ensure that the criterion is at least moderately reliable. If not, the test 
constructor can perform some "criterion research," as it was alluded to 
in Chapter 4, to help establish a reliable assessment. 

as said that a predictor test is initially 


Job Analysis. Although it wa : 3 : 2 
formed front à hypothesis—a hypothesis that a particular kind of item 


will predict an assessment—the early hunches can usually be improved 
ion. That is, if we are composing tests 


by studying the assessment situati 
141 
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to predict success in the operation of particular machinery, we will want 
to become acquainted with the job before constructing tests. Among 
other things, we will want to interview men who presently work at the 
job, talk with persons who manage the operation, and make notes about 
the exact duties which the job entails. Among the prominent things to 
look for are the following: 

1. Physical requirements. Does the job require considerable physi al 
strength? Does a tall or a short person have an advantage in the work? 
(Aircraft companies employed midgets during the Second World War 
to work inside the wings of large airplanes.) Would an older person 
meet with difficulties? 

2. Sensory and motor skills. Does the job require particular kinds of 
muscular coordination? Is acuity of vision, or hearing, or touch vital to 
the job? Is rapidity of movement a necessary attribute? 

3. Intellectual abilities. What is the employee required to know? Is 
it necessary for the employee to study and understand complex instruc- 
tions? What kinds of decisions does the job require? 

4. Personality requirements. Using the word “personality” in the broad 
sense, what type of person gets along well on the job? Is it necessary for 
the employee to supervise other persons? Does the job require courage 
and/or calmness? Does the employee work independently or in coopera- 
tion with other men? 

A job analysis such as this will give some hints as to the kinds of tests 
which are likely to serve as successful predictors. In some situations, it 
is not necessary to go through such formal procedures of job analysis 
before constructing predictor tests. For example, in making up tests to 
predict school success, it is often assumed that the intellectual abilities 
will be the most important determinants of success; and tests appropriate 
to the type of school program can often be constructed without a careful 
scrutiny of the classroom work. However, when a new assessment is to 
be predicted, and the test constructor is not well acquainted with the 
job, a job analysis offers the most logical first step in a testing program. 

Type of Test. After surveying the job requirements and making some 
decisions about the attributes which will be tested, it is necessary to 
decide what form the tests will take. It may be that some of the attri- 
butes will have to be tested on single individuals, whereas some of the 
other tests can be given to a group of men at one time. Some of the 
attributes may require performance tests, and mechanical “gadgets” will 
need to be constructed. Some of the tests may be susceptible to “objec- 
tive” scoring; judges may be required for the rating of other test results. 
Such decisions about the nature of tests are determined in a large 
measure by practical considerations, by the resources available to the 
testing programs and the amount of time which each subject can spend 
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in taking tests. That is why paper-and-pencil multiple choice tests are 
used more frequently than any other kind. They are the easiest to con- 
struct, administer, and score. 

The Item Pool. After a decision has been made to construct a par- 
ticular type of test, the next step is to compose a large number of test 
items, an item pool. The item is the lowest common denominator of a 
"thing" which is scored. An item may be the time 
anical puzzle, the blood pressure of a subject, 
em is the indivisible scoring 


test, the individual 
taken to assemble a mech 
or an arithmetic problem. In each case the it 


unit. 
The kinds of items which are included in the item pool depend on the 


definition of the item universe, which can, at least theoretically, be 
thought to underlie particular tests. For example, if it is thought that 
mathematical reasoning problems will predict an assessment well, the 
item pool is filled with mathematical reasoning items. However, there 
are different kinds of mathematical reasoning items, and the test con- 
structor sometimes has difficulty in stating explicitly which kinds will be 
included and which will be ignored. 

The more items which are composed for the item pool, the more room 
there will be to pick and choose among them in constructing a test. As 
a minimum, the item pool must, of course, have as many items as are 
planned for the eventual test. If resources permit, the item pool should 
have three or four times as many items as will be used in the test. It 
helps to have several persons compose items; otherwise the items might 
overemphasize the language and types of problems known to one person. 


The Review of Items. Before items are tried out empirically, the item 
pool should be inspected by several sophisticated judges for obvious 
following things: 


flaws. The judges should look for the 1 
s possible to phrase an item so 


1. Unclear statement of problems. It is pos: s 
poorly that even a person who knows the subject matter thoroughly will 


mark the wrong alternative. Most of us are better able to find other 
5 3 7 istake: or a: 

people's awkward expressions than to note our own mistakes. Therefore, 

an independent group of judges can usually point out a number of items 


which should either be thrown away OF should be reworded. 
2. Wrong answers. Test constructors are capable of making errors, and 
what is stated to be the correct answer for a problem may not be correct 
at all. It is not uncommon to find tests some of whose items have no 
correct alternatives. The probability of this happening diminishes with 
the number of sophisticated individuals who inspect the items. 

3. Inappropriate difficulty. An inspection of the items will help weed 


out some of those which are either so easy that everyone will get them 
correct or so hard that practically no one will get them correct. An 


empirical tryout of the items will give more exact information about the 
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difficulty of items, but some initial inspection of the item pool can weed 
out extreme deviants. 

4. Inappropriateness to item universe. Judges should look for items 
which appear to be unrelated to the attribute which the test is intended 
to measure. For example, if a test of mathematical reasoning is being 
constructed, you would not want to have questions which use Latin 
words. It would then become partly a test of the individual's knowledge 
of Latin. You would probably not want to have problems dealing with 
specialized mathematical techniques, such as calculus. The individual's 
ability to answer the question would depend on whether or not he had 
studied the specialized techniques rather than on his ability to reason 
mathematically. In general, the item pool should be kept free of all 
abilities and types of information which are not directly related to the 
attribute to be measured. 

Empirical Tryout of Items. The next step in test construction is to 
administer the whole item pool to groups of subjects. If the item pool is 
large, say over three hundred items, it may be necessary to give part 
of the items to one group and part to another group. If more than one 
group is used, the groups should be homogeneous in respect to ability. 
It is essential that relatively large numbers of individuals try each item. 
If there are fewer than 100 persons, only a limited amount of information 
can be obtained from the tryout. In order for the results of the tryout to 
be acceptably stable, as many as 300 subjects should be obtained, and 
the use of 1,000 subjects is not unusual. The availability of subjects sets 
a limit to the methods of test construction which can be used. If only 
several dozen subjects are available, which is often the case in composing 
tests for particular jobs, an empirical tryout of items will give statistical 
results which are more misleading than helpful. However, an empirical 
tryout with a few subjects is helpful in improving the test instructions 
and administration procedure. 

After the empirical tryout of the item pool, a number of statistical 
procedures can be used to help construct a test. The purpose of item 
analysis is to select from an item pool a minimum number of items 
which will give a maximum prediction of a criterion. 


MULTITEST ITEM ANALYSIS 


In establishing the item pool, it is assumed that the items all measure 
the same thing. However, in spite of the best efforts, it may turn out 
that there are several different attributes involved in the items. For 
example, in a group of mathematical reasoning items, it may be found 
that the questions which require considerable arithmetic computation go 
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together as an attribute and that the items which require little com- 
putation and considerable reasoning go together as another attribute. 
Factor Analysis. There are statistical procedures which can be used to 
determine whether or not an item pool contains only one attribute, and 
if not, how many attributes are involved. The most thorough procedure 
to use is factor analvsis (to be discussed in the next chapter). This and 
other related procedures deal with the correlations among items. That is, 
just as it is possible to correlate tests with one another, the separate 
items within a test can be correlated with one another. If only a “pass” 
or “fail” score is given to each item, as is usually the case in multiple 
choice tests, special correlational formulas are required (sce 8, chap. 10; 
10, chap. 6). 
An attribute, or factor, is defined by a group of items whose members 
correlate highly among themselves and correlate very little with the items 
Hups. If two items correlate highly, it means that they are 
hing. In very few testing programs is it 
lating all of the items in an 


in other groups. 
measuring much the same t 
feasible to undertake the labors of intercorre 
are 300 items, 44.850 separate correlation coefficients 


item pool. If there " A 
ent work of analysis would be 


would need to be computed. The subsequ 4 
even more laborious, making for an amount of calculation which would 
strain even our modern electronic computers. 
Approximate Procedures. One approximate proc 1 : 
lyze only a part of the total item pool. It is within practicality to factor 
analyze, say, a group of 50 instead of 300 items. The items which are 
factor analyzed can either be selected randomly from the total item 
pool; or, if it is suspected that certain groupings of the items will occur, 
a number of items can be chosen from each of the suspected groups. A 
factor analysis will then show how many factors are involved in the 
subsample ‘of items from the pool. Then, by correlating all the items 
which were not in the factor analysis with each of the factors, the 
groupings for the total item pool can be —€— Munich numerous 
other such approximate procedures which are used to decide how many 

tests should be constructed from an item pool (see 14). M 
Difficulty Levels. If the multitest approach is being used and statistical 
analyses have been undertaken to spe ify the ien. groupings, one more 
type of information is required before composing tests to measure the 
factors. It is necessary to determine the difficulty level for each item, 
which is the per cent of persons who get an. item COTTER You will 
remember that the purpose of item analysis is to maximally predict a 
criterion. In Chapter 5 the effect of the standard deviation on the 
as described: the larger the dispersion, the higher 


correlation coefficient wa : vale i ; 
the correlation in general. The difficulty levels of the items in a test 


procedure is to factor ana- 
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determine, in part, the standard deviation of scores on the test. Some 
illustrations will help to show how this works. If the items are so easy 
that everyone gets all of the items “correct” (the difficulty level for each 
item is 100 per cent), everyone will obtain the same score, and the 
standard deviation will be zero. The standard deviation will also be zero 
if the items are so hard that no one gets any of the items correct. 

If a multiple choice test is being used, guessing will probably influence 
the item difficulty levels. A certain per cent of the subjects will get an 
item correct by guessing even if no one really knows the answer. If there 
are n alternatives for an item, 1/n of the subjects would probably get the 
item correct by guessing alone. Consequently, it is not expected to find 
items with difficulty levels below 1/n. The minimum expected difficulty 
for five alternatives is 20 per cent, for four alternatives 95 per cent, and 
so on.! 

The number of alternatives used for each item usually restricts the 
range of possible scores. With five alternatives, the range is from 20 per 
cent correct answers to 100 per cent correct answers. Then in a 40-item 
test the scores could range from about 8 to 40. The problem is to choose 
items in terms of difficulty levels in such a way as to spread people out 
widely in the possible score range. The rule is to pick a set of items whose 
average difficulty level is near the middle of the possible score range. 
That is, in developing a test with five alternatives for each item, items 
should be chosen in such a manner that their average difficulty is about 
60 per cent. The average difficulty is found by summing the separate 
difficulty levels and dividing by the number of items. It is not important 
that the mean difficulty be exactly at the middle of the possible score 
range, but the standard deviation of test scores will be lowered if the 
mean difficulty is near either the bottom or top of the possible score 
range. 

In addition to the decision to choose items that have a mean difficulty 
near the middle of the possible score range, there is a question as to the 
distribution of difficulties that should be sought. Should all of the items 
have difficulties close to the middle of the score range, should half of 
the items be very easy ones and the others very hard items, or should 
à particular distribution of difficulties be sought? The ideal distribution 
of difficulties varies in terms of the use which will be made of the test 
and the intercorrelations of the items. Consequently, no 
can be given for all situations. 

After an empirical tryout of an item pool 
is to choose approximately an e 


"airtight" rule 


a good general procedure 
qual number of items at each difficulty 
1 : : ; 
Although this rule is generally useful, there are special circumstances in which it 


is inappropriate. This is particularly so when the item alternatives are carefully 
designed to contain “misleads.” 
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pode 5 with difficulties between 20 and 30 per cent, 5 items "à ih 
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1 1 5 number of items at particular difficulty levels, there 
If the 85 n as to which of the items should be included in the test 
dd ad Hun been previously studied by factor analysis, those items 
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predicti c A 3 
tion of a particular criterion. 


A test should be constructed 
s). or at least for the ones 
| the separate tests can be 


ALYSIS 


UNITEST ITEM A 
am that only one test can be used 
to derive only one test from 
d in part by principles of 


It m: 
or, in nay be decided in a testing prog" 
; in other wor ir . 
an it ther words, that it will be permissible 
en ae A 4 
1 pool. This decision may be determine 


econo; 

" ause iti d i i 

ny, because it is usually more expensive to derive and use more 
confusion as to what constitutes 
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are Ti However, there is some i 
is said 5 : the scores on 60 items are added up to obtain one score, it 
groups zh one test is in use; but if the items are divided into three 
Separate] COQUE each, and each of the three groups of items is scored 
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ests being used as there are separate scores for each 
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m pool with the assumption that 
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one test will be developed. because, as was said in the previous 


fea statistical analyses may indicate that a number of factors under- 
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o distinguish certain “upper” and 
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ule is most applicable 
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It mus 
must be kept in mind that the r 
d only t 


Which reli 
fani, p ible differentiations at all points 
ower” certain special cases à test is use! 
difficult; groups, for example to pick out the top 
Possibl ies would be massed near the cutting point rather than near the middle of the 
Sent dup e range: That is, most of the items would be chosen in the 60 to 90 per 
difficulty range. Other procedures of item selection are required for special 


Purposes, 
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analysis needed to determine the factors in an item pool are not feasible 
in many testing situations. 
Multiple Regression. If the whole item pool has been given an empir- 
ical tryout, there is an ideal way to combine the items into a test: namely. 
to combine all of the items by multiple-regre 


sion formulas to predict 
the criterion. The analysis would first require all of the intercorrelations 
among the items, plus the correlation of each separate item with the 
criterion. Then it would be necessary to run a complete multiple-regres- 
sion analysis, obtaining a beta weight for each item. If this ideal method 
of analysis were undertaken, it would give the best “least squares” pre- 
diction of the criterion and would even do a better job than the multitest 
approach. Unfortunately, such an analysis is totally impractical. Multiple- 
regression weights are difficult enough to compute with five or six vari- 
ables; it is out of the question to perform such an analysis for, say, the 
300 items in an item pool. Not being able to perform this ideal type of 
analysis, some approximate approaches are necessary. 

Item Difficulties. What was said about item difficulties in respect to 
the multitest approach holds as well for the unitest approach. The more 
variance in the total scores the more the test is likely to correlate with 
the criterion. Therefore, a set of items should be chosen whose mean 
difficulty is at about the middle of the possible score range. The items 
should also be scattered along the range of difficulty in general resem- 
blance to the scheme which was previously described. 

Item-criterion Correlations. If it is ; 
correl 


possible to perform the labors of 
ating each item in the item pool with the criterion, this information 
can be combined with a knowledge of difficulty levels to make a more 
predictive test. The test score is usually obtained as a sum of the item 
scores, which is the number of correct responses. If the individual items 
do not correlate with the criterion, the total score will not correlate with 
the criterion either. 

A test will be more predictive if the items which correlate well with 
the criterion are chosen. The rule is that from the items at each difficulty 
level select those which correlate highest with the criterion. That is, if 
the item pool has 20 items with “difficulties” between 30 and 40 per cent, 
and the test will only be long enough to use five items in that category, 
the five items should be chosen which have the highest correlations with 
the criterion, 

Interitem Correlations. If it is possible to obtain not only the difficulty 
levels and the item-criterion correlations, but the intercorrelations of all 
of the items in the item pool as well, this information can be used to 
derive a more predictive test. It was said that multiple regression is the 
ideal method of unitest analysis. You will remember from Chapter 7 that 
the multiple correlation is highest when the predictor variables (items 
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in this case) correlate high with the criterion and low with one another. 
Therefore, items should be chosen which not only correlate as much as 
possible with the criterion and conform to the requirements for difficulty 
level but which correlate as little as possible with one another. There 
are a number of item-analysis procedures which take into consideration 
both item-criterion correlations and interitem correlations (see 3, 6, 7, 10, 
12). 

In the multitest method, 
cause they correlate highly with one 
should correlate low with one another. 
any one of the tests derived by the multitest approach should be homo- 
geneous in respect to one another, and the items within a test obtained 
by the unitest approach should be heterogeneous in respect to one an- 
other. Consequently. the apparent paradox introduced earlier in the 
chapter is not a paradox after all: different rules applv in accordance 


With the item-analvsis strategy being used. 


items are grouped. or placed in factors be- 
another. In the unitest method, items 
In other words, the items within 


DECISIONS ABOUT THE CONSTRUCTION OF PREDICTOR. TESTS 


es of analysis which can be 


formal procedur 
do in constructing 


“What do people usually 
often constructed without any kind of 
item analysis. People are often able to “dream up” a set of test items, 
which, when tried out, conforms fairly well to the statistical requirements. 
The author often requires his students to compose tests for the prediction 
of school grades. The students have neither the time nor the resources 
to compose large item pools, undertake empirical tryouts, or perform 
elaborate stntiztieal analyses. It is surprising to find how many of these 
tests actually predict the criterion with moderate success and also have 
distributions of item difficulties which are acceptable. , 

It is not recommended that formal procedures of test construction be 
ignored, but it is possible, and often happens, that an vera 755 
Posed test will give satisfactory prediction. Howev m qms s 
Procedures are not dependent on the success of the ine zm ua A 5 
Pots and the item-analysis methods will almost always lead to a 
Higher f rien correlation 

; tel ere aos performing item analyses, then a decision 
dures should be used. A failure to dis- 


tinguish between the multitest and unitest approaches has caused some 
Confusion about proper methods of analysis. If the additional labors 
required by the multitest method can be undertaken, that will usually 


give the most satisfactory results. x ee 
In addition to the likelihood of obtaining à higher prediction of the 


Having seen some of the 
Used, we should next ask. 
tests?” To begin with, tests are 


If resources are available 
Must be made as to what proce 
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criterion from the multitest approach, the results have a gencrality 
beyond the immediate testing problem. The unitest method tends to 
produce a conglomeration of items which is directly related to only one 
assessment, or criterion. The multitest method divides items into homo- 
geneous groups, or factors, which can be validated against numerous 
criteria. Factor analysis is often used when there is no particular cri- 
terion to be predicted but where the effort is to derive general attributes 
which can, in the course of events, be used to predict many different 
criteria. 

Taking Advantage of Chance. There is an important caution to be 
noted in the interpretation of the results of any item analysis which 
selects items in terms of the item-criterion correlations (which is done 
with some of the unitest methods but not with the multitest methods). 
If there are many items in the item pool, some of them will correlate 
highly with the criterion if only by chance. The more individuals used 
in the empirical tryout, the smaller will be the sampling error of correla- 
tion coefficients. As an extreme example, if the number of individuals is 
no larger than the number of items, it is always possible to find a perfect 
multiple correlation between the items and any other variable, You can 
name any other variable that you like, the weights of the subjects, for 
example, and the perfect multiple correlation can be found. However, 
the apparent validity would be only a spurious capitalization on sam- 
pling errors. 

When we select only the few items which correlate highly with the 
criterion, this “takes advantage of chance." 
will show that the selected items correlate lower with the criterion than 
was indicated in the first tryout. Consequently, a test derived from the use 
of item-criterion correlations will work less well in subsequent use than 
it appears to work on the original tryout group. After obtaining a test 
from item-criterion correlations, the test scores can be 
the criterion scores. We refer to this type of correlation 
because the measure of predictive validity is obtained on the same sample 
of scores used to develop the test. Re-correlations are almost always 
spuriously large. There have been numerous instances in which a derived 
test has been re-correlated with the criterion to produce a seemingly 
handsome validity, only to find in subsequent research that the test has 
no predictive validity at all. 

Cross validation. Re-correlations are so misleading that it is better not 
to compute them. After the item analysis of the first tryout group, the 
test obtained should be administered to a second group of individuals. 
The correlation between the scores made by the second group and the 
criterion is the predictive validity of the test. The use of a second group 
in this way is referred to as cross validation. 


A second empirical tryout 


correlated with 
as re-correlation, 
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CONSTRUCTION OF ASSESSMENTS 


The major concern in this section will be with the construction of class- 
room examinations. Many of the varied assessments which are made of 
human performance are not, properly speaking, tests at all. The promo- 
tion of a naval officer, the hiring of a chef, the elevation of a business 
executive to a higher position may entail none of the formal, standardized 
procedures which we call tests. However, many of the requirements of 
a good assessment test apply to assessments in general. 

The Importance of Tests. Whenever students receive grades for their 
work, tests have a very large influence on human lives. The grade which 
a student obtains may determine whether or not he graduates from high 
school, is given a better position in civil service, or receives a commission 
in the Armed Forces. The teacher who takes his work seriously finds the 
fair allotment of grades to be the most arduous, and sometimes painful, 
part of his job. Every teacher has seen the time when a student’s whole 
life had to be rearranged because of a failing grade. 

Tests not only share in the responsibility of allotting grades, they also 
determine in large measure what is learned. Students are very quick to 
detect what the teacher emphasizes in examinations, and it is only human 
for the student to work hardest on the things which are likely to appear 


on the test. Tests can also affect what the teacher teaches. If standard 
stems, the 


achievement tests are used, as they are in many school 
is often under pressure to have the students do as well as 
He is then likely to slant his instruction toward the things 
^ point of nearly re- 


teacher 
possible. 
emphasized on the achievement test, even to the 
hearsing the actual examination. 

Because of the importance of classroom examinations, they merit very 
It is unfortunate that the work load imposed on 
many teachers prevents their careful attention to the construction and 
grading of each test. A classroom examination is “good” if it is both 
reliable and valid. Anything that can be done to raise the reliability of 
the examination along the lines discussed in Chapter 6 will make for a 
better test. Validity of a classroom examination means content repre- 
sentativeness, in other words, that a wide sample of important questions 


careful preparation . 


has been employed. 
Educational Goals. What constitutes important materials can be deter- 
careful attention to the aims of the course. 


mined only by the teacher's t : 
Part of this decision should be determined by the future courses which 
the student will take. The first course in caleulus must prepare the 


students for the second course in calculus. Although there is some latitude 
as to how it is taught, there are certain fundamentals to be covered. 
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Outside the common core of knowledge that is expected to be imparted 
in a particular course, the instructor adds his own particular flavor to 
the subject he teaches. The student “takes the instructor” as well as 
taking the course. Each teacher has his own pet theories, anecdotes, and 
his strong and weak points in imparting the course material. To an 
extent this is very good, because much of education is concerned with 
learning how other people think. Whatever special directions are given to 
the subject matter, the instructor should be aware of what he is trying 
to teach and make course examinations which will reflect his goals. 

Outline of Content. Most instructors find it helpful to outline the goals 
of a course, to write down the important principles that they want to 
convey. Such an outline is not quite the same as the usual course outline, 
the latter being more concerned with the order of presenting topics than 
with the goals of instruction. In this textbook, for example, the goal is 
to convey several dozen general principles, some of which conce 
The logical foundations of psychological measures 
The distinction between “processes” and individual differences 
The characteristics of assessments and predictors 
The nature of statistical prediction 
. The influence of measurement error on predictions 
The author bases his own course examinations on a longer list of such 
issues. These are the things which he thinks will help the student in 
taking future courses, provide the widest background for using tests, 
and generally make him more erudite in the subject matter. 

The outline should be spelled out in greater detail than the several 
points listed above, including subheadings under each topic concerning 
the more particular goals in each section of instruction. Although most 
instructors agree, at least in principle, that the gen 
are the more important things to convey, the outline 


rn: 


o 40 — 


Su 


eral points of view 
will include a knowl- 
edge of particular techniques, facts, and even some names and dates. The 
instructor must decide which details are important for students to learn 
and, consequently, which of these will find their way into examina- 
tions. | 

A carefully done outline of course goals will not 
in constructing examinations, it will help the instruct 
course. He may note that he has been underemphasizing important points 
in the outline, or that relatively unimportant topics have occupied too 
much of the course time. After the outline is constructed, 
changes in it would be expected with the instructor's contin 
ence with the subject matter, the problem is to translate the outline into a 
successful examination. At this point we depart from the skills of cl 
room instruction and move into the speci 


only prove useful 
or in teaching his 


and some 
ued experi- 


ass- 
alty of psychological measure- 


ment. The instructor who is wise in the ways of teaching and who may 
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outline excellently the class goals may fail miserably in translating the 
goals into a creditable examination. Similarly, the measurement specialist 
is often not qualified to make decisions about the importance of various 
aspects of particular subject matters. A happy blending of skills occurs 
when the teacher has a firm understanding of the basic principles of 


testing. 


ESSAY VERSUS MULTIPLE CHOICE EXAMINATIONS 


ere is, at least conceivably, a wide choice of 
forms for a particular assessment, including performance tests, work 
samples, ratings, and many others. The attention here will be given to 
paper-and-pencil forms, by far the most popular type of classroom 


exumination. 

In the essay examination, 
tions or problems and asked to supply the answers. In the 
form the student is presented with a large number of questions, with a 
limited number of alternative answers for each. Considering the wide use 
of both of these forms and the amount of controversy which has been 
generated over which is "better," their differences and special advantages 
will be discussed in some detail. 

Convenience. The essay examination is comparatively easy to con- 
struct but difficult to grade with more than a dozen students. An ade- 
quate multiple choice examination is much more difficult to construct, 
but it can be graded easily even with hundreds of students. There are 
no forms to be reproduced with the essay examination and no special 
materials are needed. The questions can be written on the blackboard 
for all students to see. It is usually necessary to have the multiple choice 
nd mechanically reproduced, both of which may be 
acilities. In. addition, the multiple choice test may 
sheets and scoring forms. In general, the more 
students to be tested either at one time or in the eventual use of the test. 
is obtained from the multiple choice form. 
test is usually more reliable, and sub- 
test. The unreliability of the essay 
from test scoring and from 


As with predictor tests, th 


the student is presented with a few ques- 
multiple choice 


examination typed a 
à strain on school f. 
require special answer 


the more practical convenience 
Reliability. The multiple choice 


stantially more so, than the essay 
s from two main sources: 
t. Teachers often disagree about the grade for 


individual teachers often give considerably dif- 
papers which they regrade after a period of 
are dependent on the teacher, with some in- 
liable in their grading. There is more error 
ent in essay examinations because of the 
ich are used. A student’s grade can depend 


examination come: 
the sampling of conten 
particular papers, and 

ferent marks to the same 
time. Such inconsistencies 
structors being much less Te 
due to the sampling of cont 
small number of questions wh 
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considerably on his “luck” in having understood the particular materials 
on the test. The many items which can be used on a multiple choice test 
provide an opportunity to spread questions widely over the subject 
matter. 

Representing the Important Content. It is in this category that the 
strongest arguments have been given for the use of the essay examination. 
It has been pointed out by many persons that the multiple choice exam- 
ination often emphasizes the simple facts of a subject matter while pro- 
viding no evidence about the student's knowledge of more general issues. 
The multiple choice test has been criticized as being only a measure of 
memory rather than a measure of understanding, Another argument is 
that the multiple choice examination reflects none of the student's 
creative ability. 

Proponents of multiple choice examinations point out that regardless 
of how valid essay examinations might be in theory, the unrcliability of 
essay examinations keeps them at a relatively low level of validity in 
practice. The criticisms of multiple choice examinations usually concern 
how bad they can be when improperly constructed and ignore the 
extent to which important materials can be framed in multiple choice 
form by the ingenious test constructor. There is a considerable amount 
of research evidence which indicates that well-constructed multiple 
choice examinations can effectively measure many different kinds of 
course content, even with materials that would not seem appropriate. For 
example, Tyler (13) found a correlation of .85 between the abilitv of 
students to draw generalizations from data and a multiple choice test, 
the alternatives being better and worse generalizations which could be 
attributed to the information. Even in the testing of French pronunci- 
ation, high correlations were found between samples of speech and a 
multiple choice test (9). Other studies have shown that multiple choice 
tests can effectively measure some of the skills in writing and mathe- 
matics (2, p. 78). 

Rather than try to decide whether multiple choice examinations are 
generally better than tests of the essay type, or vice versa, it would be 
more appropriate to see how they can both be made as effective as 
possible and how they can be used in conjunction with one another. 
When the course material is mainly factual and technical, or the con- 
cepts to be taught are relatively simple, there is a definite advantage for 
the multiple choice test. Multiple choice tests are generally more appli- 
cable at lower than at higher levels of education, Multiple choice tests 
can be justified in elementary school, and in many of the courses in high 
school and college. Some of the courses in high school and college which 
require creative talents and the ability to organize ideas have a more 


definite need for essay type examinations. In most graduate studies, it is 
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very difficult to place the whole burden of determining grades on 
multiple choice tests alone. 


COMPOSING ESSAY EXAMINATIONS 


The Wording of Questions. Essay questions should range as broadly 
ible across the course content. Content representativeness can be 
increased by having a larger number of questions for which the student 
is expected to write only a paragraph or two. Very general questions 
should be avoided, such as “How did Hitler come to power?” “What 
happened to art in the fifteenth century?” It directs the student more 
accurately to a particular topic if the essay question spells out the points 


an example being as follows: 
r of estimate with tests under the 


as p 


which are to be discussed 

Discuss the use of the standard erro 
following headings: 

1. The statistical derivation 

2. The meaning of the “per cen 

3. The effect on the standard error o 

persion of assessment scores 

4. The assumption of homoscedacticity 

Choice of Questions. A widely used procedure with essay examinations 
is to let the student choose a few from a larger number of questions. 
Although this is a popular procedure with students, it has definite dis- 
advantages, Because the questions on a test are intended to be a sample 
of the course content, allowing the student to choose among the questions 
reduces the value of the test as a sample. Also, this procedure makes it 
difficult to compare students with one another. It is possible that two 
students would write on entirely different questions, providing no basis 


of comparison between them. : 
Grading. There are several steps which can be taken to remove the 
largest flaw in essay examinations, the unreliability of grading. The 


primary way to increase the reliability is to use more than one grader. 
The student should then be given the average grade. The use of two 
graders will substantially raise the reliability, and if it is feasible, as 
many as five instructors can collaborate to make the grades even more 
reliable. Multiple gradings are usually employed with important graduate 


t of variance explained" 
f estimate of changes in the dis- 


School examinations. "e 
If several persons grade the same examinations, each grader should 
have no Enowladze of the grades given by the other graders. In order 
to discount the personal likes and dislikes of the graders, the students’ 
names should not be placed on the tests. Before giving grades, it helps 
all, or a number, of the tests. A better 


the instructor to first read through 4 : 
d by grading one question through all of 


basis of comparison can be ha 
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the examinations and then returning to the second question, It is better 
to list the grades for each question on a separate sheet, If grades are 
placed on the tests, the first grades which are given might, prejudice 
the instructor toward later questions. 

It is very difficult for the instructor to avoid certain biases in the grad- 
ing of essay examinations. The student who writes well in general has 
an advantage on most essay questions. If there are too many questions 
set for the allotted time period, an advantage is given to the person who 
can express himself rapidly in writing. The student with poor penman- 
ship and spelling is often penalized on examinations that have little to 
do with basic writing skills. Unless the course material specifically con- 
cerns writing skills, the instructor should grade in terms of the content, 
not in terms of the elegance of presentation. 


THE CONSTRUCTION OF MULTIPLE CHOICE EXAMINATIONS 


The Item Pool. A collection of items is needed for the classroom test 
as well as for the construction of a predictor test. It is also helpful to 
have many more items than will be used in the test. The principles used 
in the selection of the “best” items for the classroom test are different 
from those used to compose a predictor test. In composing a predictor 
test, items are selected in such a way as to maximize the prediction of 
à criterion. In constructing an assessment test, items are selected in such 
à way as to maximally represent the performance or, in the case of a 
classroom examination, the course content. 

The outline of content serves as a guide to the collection of items. One 
way to proceed is to make up a number of items for each of the sub- 
headings in the outline. Instructors usually find it casier to collect items 
over a period of time to fit the various parts of the outline, rather than 
compose them all at once. The author establishes an item pool for a 
course in tests and measurements by writing down items as they occur 
to him, or after an important issue has been raised in classroom dis- 
cussion. Over a period of time some hundreds of items have 


been ob- 
tained in this way, with different item pools available for each of the 


several tests used during the semester. 

Choice of Items. Most instructors agree that the general points of view 
and the ability to use course content are more important than segregated 
facts and memory for specific details. Although some simple factual items 
should appear in most tests, many of the items should concern the 
student's ability to understand issues and reason from one situation to 
another. Items of that kind are difficult to construct, but carefully com- 
posed multiple choice items can measure the student's understanding of 
complex issues. 
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The Wording of Questions. Careful attention should be given to the 
wording of both predictor and assessment test items. There is generally 
more difficulty in the wording of classroom examinations because the 
questions are usually more complex. The inexperienced instructor often 
makes some glaring mistakes in the wording of multiple choice items. 
There is a considerable body of information on the writing of test items 
(see 1, 4, 11). Some of the most important rules will be briefly described. 

1. Make all of the alternatives grammatically consistent with the stem 
(the "question"). For example, with a stem like "The piston ring is," 
wal noun for an alternative, such as “three small 
> vou should not include 


you would not use a plu 
gears." Or if the stem ends with the article “ 
alternatives which begin with vowels. There are many other ways in 
which grammatical inconsistencies can “give away” an item. 

2. Do not make the correct alternatives consistently longer or shorter 
than the incorrect alternatives. In the same vein, do not overspecify the 
correct alternative, such as using remarks in parentheses and extra detail 
which will give away the answer. A typical mistake is to be rather casual 
in the formulation of incorrect alternatives and to highly specify the 
correct alternatives. 

3. Make the incorrect alternatives plausible. 
gets the correct answer because the other alternatives are obviously not 
related to the subject matter. The rule is that all of the alternatives 
should be equally appealing to the person who does not know the answer, 
but only one alternative should make sense to the person who knows 
the answer, Illustrating this rule with the question, “Which one of the 
following is one of the psychophysical methods?” we would not employ 
an incorrect alternative such as “calculus,” but a plausible, though in- 
Correct, alternative such as “triadic inconsistencies. 1 Š 

4. Avoid alternatives like “none of the above.” It is difficult to frame 
items in such a way that one of the alternatives is absolutely true in all 
circumstances. The better logic is to construct alternatives in such a way 
that one of them is markedly more correct than the others. The use of 
“none of the above” places a double and somewhat confusing burden 
on the student: he not only must detect the alternative which is most 
correct, he must decide whether or not it is absolutely correct. A 

5. Use a random order for the alternatives in each Atem; This can be 
done with a table of random numbers to be found in many Bie on 
Statistics (see 5). Another approach is to write ip ape potions 
on separate file cards (if there are five alternatives, the letter ‘ would 
be written on one card, the letter b on another and so on to e ); the cards 
can then be shuffled and dealt out to allot alternatives to their positions 
beneath the question. Some instructors arrange the alternatives sys- 


tematically throughout the te: it the correct alternative ap- 


Sometimes an individual 


st, insuring thi 
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pears an equal number of times at each position. Even this method can 
give away answers. Randomization of alternatives is the safest procedure. 

Review of Items. As in the construction of predictor tests, it is helpful 
to have several persons, or at least one person, review the items which 
go into a classroom examination. This may be too laborious for the whole 
item pool, but someone should at least read and criticize the smaller 
number of items which are used in the test. All of the points mentioned 
earlier in respect to the review of predictor test items apply to assessment 
tests. 

Selection of Items for the Test. Whereas elaborate methods of item 
analysis were recommended for the selection of items for a predictor 
test, the item pool need only be sampled to obtain the items for an assess- 
ment test. If the item pool is representative of the important course con- 
tent, and this rests on the instructor’s judgment, a sample of the items 
will serve as an adequate test. The author selects the items for his class- 
room tests by drawing 40 items at random from an item pool. Any ran- 
dom sample of forty will give much the same results that would be 
obtained from any other random sample and much the same results as 
if the whole pool were made into a test. The sample of items should be 
inspected to make sure that they have all been covered in the course 
work. 

Some people feel uneasy in randomly selecting items from the item 
pool, because they think this method is loose and inexact. But randomness 
is desirable both from the standpoint of sampling logic and to prevent 
the test from being slanted toward any particular feature of the content. 
If the class is told in advance that the items will be randomly drawn 
from a larger pool, they are more content to apply themselves to the 
whole subject matter rather than to try to "dope out" what the test will 
be about. Also, because the instructor will not want to use the same test 
over and over with new classes, a new test can be constructed by a 
second random drawing of items. b 

It is sometimes useful to divide the items in a pool into several cate- 
gories and randomly select a number of items from each. This can be 
done if there are definite divisions of the subject matter. However, the 
score obtained on a completely random selection of items will be very 
much the same as obtained by using categories. 

Item Analysis. Some information can be obtained about the effective- 
ness of an assessment test by an examination of the scores obtained by 
students. The classroom instructor seldom has the facilities to administer 
the whole item pool to a group of students and must rely, instead, on 
the results from the items in one test. There are several desirable features 
which should appear in multiple choice examinations. One of these is 
that the frequencies of choices for the incorrect alternatives should be 
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approximately equal. If practically no one chooses an alternative, it is 
a “give away” and should be replaced with a more plausible incorrect 
alternative. If one of the incorrect alternatives is chosen more frequently 
than the designated “correct” alternative, the instructor may be wrong 
himself, or the question may be misleading. Even on a classroom exam- 
ination there is little reason for retaining an item that no one, or almost 
no one, gets correct. It is more reasonable to retain a few very easy items 
as a way of giving the average student a feeling of accomplishment. 

It is axiomatic that students should receive better grades on an exam- 
ination after the course of instruction than before the course begins. 
Similarly the questions which show the biggest change in number of 
students who get the correct answer from before to after are more par- 
ticularly concerned with the course content. In large school programs, 


such as some of those in the Armed Forces, examinations are sometimes 


constructed on the basis of an improvement in item scores from before 


to after the course. This logic can be carried too far in that an item might 
be very difficult both before and after the course not because of inappro- 
ise of shortcomings of the instruction. 


priateness, but beca 


PREPARING TESTS FOR ADMINISTRATION 


After items have been selected from an item pool, either in the con- 
Struction of a predictor or an : essment test, some practical decisions 
must be made about how the items will be assembled and administered. 
It must be decided whether or not to order the items in terms of difficulty, 
first of the test and the harder ones toward 
the end. It is sometimes desirable to order the items in terms of difficulty 
to prevent students from becoming discouraged because of . 
cult items which might appear near the beginning of the test T aitama 
can be ordered in terms of difficulty only if an empirical tryout has been 
made. Such information is likely to be available for a peeictir text but 
not for the usual classroom examination. What an W will 
be an easy item may prove to be very difficult for the Glass 5 5 vice 
Versa. If difficulty levels have not been obtained from an empirical tryout, 


it is better to randomly order the items throughout the test pated Toba 
the students about it. If each item has been written on a separate file 
x y formed by shuffling the cards thor- 


card, the randomization can be per 
i 5 the order of their appearance. 


oughly and “dealing” them out in : 
[à 5 Ss ea ing The fis instructions bear a considerable burden of 
e test content. The failure to make test 


accurately directing subjects to the , 
urately directing subj d is one of the most obvious 


i i i ar taile! 
instructions sufficiently clear and de 
shortcomings of the inexperi 1 test constructor. The people who deal 


enced 
extensively with tests realize th 


placing the easier items at the 


at some of the subjects will become con- 
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fused by even the most straightforward instructions. It is not uncommon 
for subjects to fail to put their names in the designated place. If the 
subjects are told not to write on the test booklet, some booklets will 
have numerous marks on them. If the items are at all complicated, such 
as asking the subjects to pick the two most correct alternatives instead 
of only one, some of the subjects will inevitably become confused. The 
ease with which subjects become confused with unusual test items, or 
marking procedures, often forces the test constructor to abandon an 
apparently promising new approach, 

The rule is to write test instructions as though all of the subjects were 
of very low intelligence. Test instructions should be written so that the 
dullest and most confused individual will understand them. To do this, 
the instructions should be made redundant and all points should be 
generally overexplained. For example, in a multiple choice test for which 
only one alternative is to be marked, instructions should be written as 
follows: “Mark one of the alternatives for each question. Do not mark 
more than one alternative for each question. Leave none of the questions 
blank." If special answer sheets are used with the test, at least one 
sometimes several, examples should be given. 

Time Limits. If there is a time limit for completing the test, and there 
usually is, the test constructor must determine in advance how much 
time should be given. If the attribute being tested is logically concerned 
with speed, studies should be undertaken to determine the time limit 
which will produce the largest standard deviation of test scores. This can 
be done by trying a number of different time limits with separate groups 
of subjects. The time which works the best can then be made the standard 
time limit. 

There has been a general tendency to make restrictive time limits on 
tests which are not inherently concerned with speed. Some of the pre- 
dictor tests, such as numerical computation and perceptual speed, are 
defined as "speeded" abilities. However, we often find speeded tests of 
reasoning and vocabulary, where there is little ground for restrictive time 
limits. There are few classroom examinations where restrictive time limits 
are justified. Some classroom topics, such as typing and shorthand are 
naturally concerned with the ability to work quickly. However, there is 
little reason to believe that speed is necessarily involved in the under- 
standing of physics, history, psychology and most other school topics. 
When the test is not intended to measure speed, per se, 


, and 


subjects should 
be given a comfortable time limit. A few students will feel that they do 
not have an adequate amount of time even if they work an hour after 
the rest of the students have left. Some time limit is necessary simply as 
a convenience for the instructor, but most of the students should be able 
to complete the test without feeling pressured by the time limit. 
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Auxiliary Materials. Preparations must be made for any special mate- 
rials which are needed for the test. If a particular kind of pencil is re- 
quired, an adequate supply must be on hand. If devices such as rulers, 
drawing equipment, or other instruments are needed, subjects should be 
informed well in advance of the test. If mathematical problems are part 
of the test, the subjects should either be told to bring “scratch” paper 
or it should be provided at the time of testing. Here again, it pays to 
overexplain the test requirements. If the answer sheet must be marked 
in pencil, some of the students will come only with a pen. Some pencil 

ak and there must be either extra pencils or a 


points will inevitably bree 
sharpener handy, Trivial as these points seem, the test constructor must 
lowered. 


either provide for them or the test reliability will be 


SUMMARY 


There are two basically different approaches to test construction de- 
pending on whether the instrument is intended to be a predictor or an 
assessment. A predictor test is constructed in terms of statistical strategy, 
the purpose of which is to squeeze the last ounce of predictive efficiency 
out of an item pool. Some of the statistical methods which ideally should 
be applied are too complex and laborious to be used except in special 
cases, Therefore, there is a hierarchy of preferred procedures, depending 
on the time and facilities available. 
An assessment test is constructed in t 
tical standards, The purpose is to inc 4 e i 
examination, Only by a clear statement of the goals of instruction, or in 


à nonclassroom situation, the merits of different behaviors, can an assess- 
ment test be meaningfully reviewed, discussed. and eventually changed 
for the better. In the construction of both predictor and as sessment tests 
items clearly, standardize the scoring, state the 
atize the administration procedure. 


erms of rational rather than statis- 
Jude the important content in an 


it is important to write 
instructions carefully, and system 
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CHAPTER 9 


The Structure of Human Abilities 


The term “general intelligence” has been used with reluctance in previ- 
ous pages because of two misleading connotations which it bears. The 
first is that the tests which are used to measure “general intelligence” are 
not assessments, as the term “intelligence” suggests, but are predictor 
tests which have gained their meaning from numerous relationships with 
important human variables. The second misleading connotation is that 
the word "general" implies that there is only one kind of intelligence 
dimensions of intelligence. 

We usually talk as though human abilities are general. We say, *He is 
à ‘smart’ person," without bothering to differentiate the ways in which 
the individual is particularly gifted. Sometimes we do make a distinction, 
as in saying, "He is very good in mathematical work, but he is not so 
good in understanding written m 
y If intelligence is perfectly gener 
intellectual problem well can do 
the person who does poorly on 6 
all others. Looking at it in term 


abilities are perfectly general, the 
arithmetic problems will have the same facility in learning spelling, geog- 


raphy, and other topics. Another possibility is that human abilities are 
completely “specific,” that there are no correlations between different 
intellectual tasks. Then, if we know that a particular child is very adept 
at learning arithmetic, this offers no basis at all for predicting how well 
he will do in spelling or geography. Taking the case further, if abilities 
were completely unrelated, the child who could add numbers very 
quickly might be slower than the child in performing multipli- 
cations. 


Human abilities are 


The real story lies between these two ex 
discuss the history of this problem, the mathematical procedures of factor 


analysis. and some of the different abilities which underlie human in- 


tellect. 


rather than separate 


aterial.” 
al, the person who can do one type of 


all other kinds of problems well, and 
one type of problem will do poorly on 
s of the ability to do school work, if 
child who learns easilv to solve 


average 


general nor completely specific. 


neither completely 
tremes. In this chapter, we will 
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Faculty Psychology. The belief was prevalent in the early nineteenth 
century that human behavior is determined by a large number of sepa- 
rate capacities, or faculties, as they were called (see 3, 12). There were 
proposed faculties of attention, memory, reason, will power, esthetic 
appreciation, and many others. It was the early belief that each faculty 
resides in a particular brain location and that “bumps on the head” are 
prognostic of the strength of particular faculties (see 12, p. 56). Although 
the anatomical theories were soon discarded, the belief in a large number 
of specific human faculties lingered throughout the remainder of the 
nineteenth century. 

Binet and “General Intelligence.” Alfred Binet's early efforts to measure 
intelligence were in line with the faculty psychology of the time. He 
worked at first with the same kind of sensory and motor tests that were 
used by Galton and others. Some of these included suggestibility, size 
of the cranium, tactile discrimination, graphology, and even palmistry. 
Like the others who were trying to measure human abilities, he found 
that these simpler functions do not measure intelligence, as we com- 
monly think of it. 

Binet abandoned sensory and motor tests in favor of a “molar” con- 
ception of intelligence. For practical reasons, Binet thought that it would 
not be possible to measure all of the simple skills that underlie intelligent 
behavior; instead, it would be more feasible to study the end products 
of intellectual functioning. Consequently, his tests emphasized the ability 
to make sound judgments. He defined intelligence as “the tendency to 
take and maintain a definite direction; the capacity to make adaptations 
for the purpose of attaining a desired end; and the power of auto-criti- 
cism” (15, p. 45). 

The Binet-Simon scale provided each child with one score, a mental 
age score (mental age scores were introduced in the 1908 revision ). 
Because each child receives only one score, it must be assumed that 
intelligence is general, or that only one factor is involved in the test items. 
If there are, say, two kinds of intelligence involved in the test, the use 
of only one score is somewhat misleading. Two children could make 
the same score, not because they are alike in respect to their abilities but 
because one child is high in one of the factors and low in the other, and 
vice versa for the other child. The same assumption of the generality of 
human abilities is involved in all of the general intelligence tests which 
have followed from Binet’s early work. 


THE GENERAL FACTOR THEORY 


Spearman's G. At the same time that Binet and Simon were working 
in France on the first practical mental test. a psychologist in England. 
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Charles Spearman, was working from a different standpoint. Spearman 
hypothesized a general factor of intelligence, which he called the G 
factor. The G factor was thought of at that time as being a kind of 
” Unlike Binet, Spearman undertook extensive investiga- 
here is only one common factor of 


"mental ener 


tions to determine whether or not t 
(see 14). In order to do this, he administered batteries of 


and bv statistical analysis of the scores, he tried 
n factor underlies all mental tests. In this 
ques of factor analysis to. psychology. 
of the fundamental mathe- 


intelligence 
tests to school children, 
to show that only one commo 
work he introduced the techni 
Earlier, Karl Pearson (13) had made some 
matical developments. 

Correlation Matrices. 1f a number of tests have been given to a group 
; can. be correlated with each of the others, 
lation matrix. It is with such correlation 
A hypothetical matrix show- 
en in Table 9-1. To read the 


of people, each of the tes 
resulting in what is called a corre 
matrices that factor analysis is undertaken. 
ing the correlations between four tests is giv 


TanLE 9-1. INTERCORRELATIONS or Four Sciioor SUBJECTS 


Sp. Ar. Rd. Sc. 
Spelling Ü 50 68 
Arithmetic 50 Cr 
Reading 68 AD 

Ad 71 


Science 
and across a row to find the intersec- 
i E » arithmetic column and across to 
tion of two tests. By looking apr are E 8 5 SS m 
^ readi > a correlation o 49 is found between arithmetic anc 
the reading row, a corre : ee tie fieri di ad 
reading. Note that cach correlation appears tw ice in the matrix. A correla- 
tion of .49 is also found by looking down the reading column and stopping 
i gonal cells of the matrix have been left 
; with themselves, which are, 


matrix, simply look down a column 


at the arithmetic row. The dia 
blank. These are the correlations of the tes he 
technically, all 1.00, but are allowed to assume different values in the 
actual factor-analytic work. 
The Search for G. Spearm: 


an's hypothesis can be more completely 
specified by the statement that mental tests have only one common factor 
and that all else is due to factors specific to each test. That is, part of the 
test variance can be accounted for by one common, or general, factor, 
and otherwise, mental tests have nothing in common. Spearman thought 
that each test would have some particularity, or some specif ity, con- 
nected with its items. The concept of "variance explained" and the 
diagrams used to show the overlap of variances (from Chapter 7) can 
be used here to demonstrate factorial composition of tests. The over- 
lapping variances (squared corre 


lations) for one common factor in three 
s Hone D= 
tests can be pictured as in Figure 9-1. 
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In Figure 9-1 the three tests overlap in only one area, the cross-hatched 
section, which is labeled G. Each test then has a portion of its area which 
is specific: it does not overlap with the G area and it does not overlap 
with the other tests outside of the G area. 


ZA Ficure 9-1. Variance diagram of one common factor (G) 
in three tests. 


Spearman proved that if there is only one common factor in a set of 
test scores, the columns of coefficients in the correlation matrix will be 
proportional to one another (14). A hypothetical matrix which fulfills 
this requirement is given in Table 9-2. The proportionality of the columns 
in Table 9-2 can be tested by successively dividing the correlations in 
one column by those in any other, for all except the vacant diagonal cells. 
Going down columns A and B in this way, it is seen that the coefficients 
in A are 1.1 times as large as the coefficients in B (.67 over .61, .61 over 
56, and .55 over .50). The coefficients in any other pair of columns are 
proportional also. 


TABLE 9-2. A HYPOTHETICAL CORRELATION MATRIX Suowinc ONE COMMON 


Facron iN Five Tests ° 

A B C D E 
(EM 74 67 61 55 
74 ($e) 61 56 50 
67 61 (i) 50 45 
61 56 50 E 4¹ 
55 50 45 41 (EN 


88982 


° All decimal points have been removed; consequently, 74 should be understood as 
a correlation of .74. 


Twenty years or more were spent by Spearman and others in England 
trying to determine whether only one common factor, G, underlies all 
mental tests. The tests used in the first stages mainly concerned simple 
sensory discrimination, but these were soon abandoned for the more com- 
plex functions of the type made popular by Binet. Batteries of tests, 
usually containing no more than a dozen instruments, were administered. 
mainly to school children. Correlation matrices were computed and the 
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proportionality of columns was studied. Even if there were only one 
common factor among all tests, the actual correlation coefficients would 
not strictly obey the proportionality criterion because of sampling error. 
Consequently, very laborious studies of the departure from proportionality 
had to be undertaken, some of these lasting for months. 

After much investigation, at first in England and later by American 
psychologists, it became apparent that one common factor is not sufficient 
to account for the intercorrelations among mental tests. In a few restricted 
cases, where the tests were much alike, the proportionality criterion was 
approximated. When a broad range of tests was used, however, it was 
apparent that more than one common factor was needed. Spearman won 
his point, in part, in that one common factor can be shown to explain 
much, but by no means all, of the overlap between mental tests. 


MULTIFACTOR THEORY 


Rather than argue over whether one common factor is sufficient to 
i T i ion s shifted to that of how 
explain intellectual behavior, the question soon sites a 
many factors are required to explain the intercorrelations of tests. The 
ginal etietical technique for determining a general factor is of little 
original statistical technique g 
use when in fact more than one factor occurs. Procedures were developed 
r 1 ap acte 3 
for discovering how many factors are involved in a battery of tests. These 
$ 8 D : “ S3e.faetor analvsic" Pear- 
statistical procedures are referred to as “multiple-factor analy sis.” Pear 
son (13) provided the original solution to the problem, Cyril Burt (5) 
in England, T. L. Kelley (11) in America, and later L. L. neces i 
in America were prominent figures in the development of factor-analytic 


techniques ; 

rera hes An approximation to factor Lag be E 
made by a simple inspection of the correlation Seinen yee un xs 
in Table 9-3 for example. A cluster is defined - gr 0 ta E 5 a 
members correlate highly among themselves and correlate relatively low 


with tests not in the cluster. 


IMNS AN 's or Ta 9—5 
TABLE 9. A REARRANGEMENT OF qug COLUMNS AND Rows or TABLE 9. 3 
ABLE 9-4. EARRANGEME? 


4 m3 8 D. (E ay 

01 64 06 155 — 13 

s ay (% ] uf 08 56 
Bow c) m Se 
V 
Dee HE TM EST 
? à m4 -B «€ -11 


° All decimal points omitted. 
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One way to begin a cluster analysis is first to locate the highest corre- 
lation in the matrix. In this case, it is the correlation of .72 between tests 
E and A. These two tests can form the nucleus of the first cluster. Any 
other test that correlates highly with both A and E can also be placed 
in the cluster. Test C correlates .64 with A and .55 with E, so it can be 
classified in the cluster. None of the other three tests correlates highly 
with A, C, and E; therefore, no other tests belong to the first cluster. 

If there are substantial correlations among the tests which do not go 
into the first cluster, this means that one or more additional clusters 
remain. We then look at the correlations among tests B, D, and F. The 
three intercorrelations are .42, 47, and .56. There are then two clusters, 
or in the same sense, two factors, in these six tests. The two clusters can 
be seen more casily if we rearrange the correlations in Table 9-3 to make 
the first three rows and columns contain the tests for the first cluster and 
the last three rows and columns contain the tests for the second cluster. 
This has been done in Table 9-4. Table 9-4 has been sectioned off to 


TABLE 9-4. A REARRANGEMENT OF THE COLUMNS AND Rows or TABLE 9-3 


A E ic D E F 
A te 3.99 464 06 01 —13 
B. 78 (Q y 55 11 08 —04 
Lo f US. (C $3 | 02 —[D$ =a) 
D 06 11 —02 Oat 
B 01 08 —07 42 ( ) 56 
F —13 -04 —19 47 586 () 


show more clearly the intercorrelations of the tests within the two clusters 
and the correlations between the tests in the different clusters. Arranged 
in this way, it is easy to see that there are two different kinds of abilities 
involved in the six tests. 

If the clusters were generally as clear-cut and as easy to locate as they 
are in the example above, there would be little need for more formal 
procedures of analysis. When the number of tests is as large as that con- 
tained in present-day factor analyses, often more than fifty, it becomes 
very difficult to locate clusters. Also, it will almost never be the case 
that tests will divide themselves neatly into clusters such that each test 
belongs unequivocally to only one cluster. More often it is difficult to 
decide whether a test should be placed in one cluster or another, and 
the tests in one cluster usually have at least moderate correlations with 
the tests in other clusters. 


The Vector Model. A model which is more general than variance dia- 
grams is required to demonstrate how multiple-factor analysis. works. 
Correlations can be portrayed by lines, or vectors, and by the angles be- 
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tween vectors. In Figure 9-2 three correlations have been pictured with 
vectors. In the vector model, each test is represented by one line. The 
closer the lines are to one another, the higher the two tests correlate. In 
the diagram on the left in Figure 9-2, the vectors for tests 1 and 2 are 


Ficurr 9-2. Three correlation coefficients represented by pairs of vectors. 


til 1-2 
2 


r:0 7:040 ME 
2 e 


means a zero 


at right angles, that is, 90 degrees apart. This alway 
correlation between two tests. In the middle diagram, the vectors are 
less than 90 degrees, and consequently, there is a correlation greater 
than zero, .40, in this case. The diagram on the right shows an even 
smaller angle between vectors, portraying a correlation of .64. 

The vectors are not all of the same length in the vector model. The 
length of the test vector depends on the over-all amount of correlation 
the test has with the others in the matrix, the "communality." The corre- 
lation between two tests in the model depends both on how large the 
angle is between two vectors and the length of the two vectors. A fuller 
explanation of how this works would take us far beyond the mathematical 
complexity which can be permitted here (see 7, 16, 19). However, the 
meaning of the model can be grasped if it will be kept in mind that long 
vectors and small angles mean high correlations. The shorter the vectors 
and the nearer the angle approaches 90 degrees, the less the correlation. 

All of the intercorrelations in Table 9-4 can be cast in the form of one 
vector model, which is shown in Figure 9-3. The vector model demon- 


Strates the wò clusters noted earlier. Tests A, C, and E cluster together 


Ficunk 9-3. Vector model of the correlation. coefficients shown in Table 9-4. 
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in the space, and similarly, tests B, D, and F are near one another. The 
tests in one cluster are separated in general about ninety degrees from 
the tests in the other cluster, showing the low correlations between the 
tests in the different clusters. 

Factor Loadings. What factor analysis does is to place new lines, or 
vectors, in the vector model. The number of new vectors is generally 


Fıcure 9-4. Vector model with factor axes. 


fu 


ZA 


equal to the number of major clusters among the tests. Figure 9-4 shows 
how a typical factor analysis would locate two new vectors, F, and V% in 
the vector model. The factor vectors go out in both directions from the 
origin, having both a plus and minus side. Describing the mathematical 
techniques which are required to place the factor vectors in the vector 
model would again take us into complexities which cannot be treated 
here. (The interested student can learn more about factor-analytic pro- 
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cedures from 7, 16, 19.) However, it is not necessary for the reader 
completely to understand the mathematics of factor analysis to compre- 
hend what the factor vectors mean once they have been located in the 
vector model. 

Each of the factor vectors reads like a yardstick, with the numbers on 
the scale showing the amount of correlation with the tests. The two factor 
vectors constitute a coordinate system, much like the axes on a piece of 
graph paper. The correlations of the tests with the factors are referred 
to as loadings. Each test has a loading on each of the two factors. Read- 
ing from the graph, test A has a correlation, or loading, of .90 on Factor 
I and a loading of —.05 on Factor II. The full set of factor loadings, 


called a factor matrix, is given in Table 9-5. 


TanLE 9-5. Facron MATRIX FOR Six TEsTs 


Test Loading 

Fi Fu 
A .90 —.0⁵ 
E 80 0⁵ 
[e] 70 —.15 
D 10 .60 
B .05 70 
F —.10 .80 


The Results of a Factor Analysis. A factor analysis can serve to sim- 
and the intercorrelations among 


plify the original battery of test scores m 
the tests. The factor matrix in Table 9-5 shows that there p" only 5 
general kinds of abilities involved in the tests instead of 7 : t this eg 
vine abilities are. Let us say tha 
we do not know what these two underlying . 1 r e ay ; J 
tests A, E, and C are concerned respectively pith read "i k ility. xd 
prehension of written material, and vocabulary. y 1 d 1 
i -bal ability. i he understanc 

Factor I as having to do with verbal ability, with ersten g i 
words. Say that tests D, B, and F concern addition, multiplication, p 

1 Sa f; s D, B, anc ahs lanes AEN 
mathematical reasoning respectively. The second 5 € it 
numerical computation, the ability to 8 i ipa ie cd Pa " 

i amed, a score ca ve 
After the factors are i terpreted and named, 
e factors are interp 99 15 ers 

each person on each of the factors. Although a precise determination of 
$ 1 omplicated calculations (see 16, chap. 15) a 
e obtained in the present example by simply 
ores on the three tests which have high load- 


: s dual on tests A, E, and C 
ings on a factor. That is, if an individuals scores ODE: 


i approximation of the score which he 
are average! is provides à good appr P 
has c E. s. m Ls aly M actor II could be obtained by 
as on Factor I. Sima 


averaging the scores on test 


Such scores requires some € 
good approximation could b 
averaging the individual's sc 


scores on F. 


s D, B. and F. 
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The Complete Model. The simplified example portrayed in Figures 
9-3 and 9-4 requires only two dimensions, which is another way of saying 
that the vectors lie completely on the flat surface of a page. The angles 
between the vectors might be such that they could not be pictured on a 
flat surface, and at least a three-dimensional model would be required. 
In practical work the test vectors will seldom lie on a flat surface, and 
even three dimensions are often not sufficient for portraying the model. 
Of course, when the model requires more than three dimensions, there 
is no physical way of portraying the vectors. However, this does not 
hinder the mathematical derivation of four, five, and even ten dimensions, 
or factors, to account for the correlations among tests. 

The ultimate aim in performing factor-analytic studies is to reduce the 
nearly infinite number of tests which can be composed to a smaller num- 
ber of basic abilities, or factors. This not only achieves a real parsimony 
in practical work; it helps us to understand the nature of human abilities. 
Factor analysis is most useful in the early stages of research, when not 
much is known about an area of investigation. It is a very helpful way 
of "mapping" the territory before proceeding to a more detailed study 
of particular attributes. 

Correlations among Factors. In Figure 9-4 the two factor vectors are 
placed 90 degrees apart, which signifies that there is a zero correlation 
between the two factors. However, in many factor analvses the factor 
vectors themselves are allowed to correlate, or in other words to assume 
angles different from 90 degrees. If the correlation between any two 
factors is allowed to become too large, then the factors begin to lose 
their value as “different” kinds of abilities. It is not uncommon to find 
factor solutions in which the correlations between particular factors are 
as high as .50. Considering that the factors themselves are correlated, 
this means that factors represent separable but not completely independ- 
ent abilities. 

Factor analvsis can be used in many different kinds of problems, but 
our primary concern will be with its application to psychological tests. 
Factor analyses have been made of mechanical abilities, interest tests, 
and personality tests, as well as of the more intellectual functions. We 
will look at the factor-analytic results in the domain of tests of the in- 
tellectual type in this chapter and later refer to factor-analytic findings 
that relate to other kinds of human attributes. 


FA 


‘TORS OF ABILITY 


The large amount of factor-analytic work performed during the last 
fifty years permits us to speak with confidence about at least some of the 
factors which underlie tests of the ability type. One cardinal finding is 
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that the correlations among human abilities are almost always positive, 
even if small in some cases. Consequently, it would be rare to find one 
human ability for which high performance indicates that the individual 
will do poorly in another type of performance. Because of the tendency 
for all tests of ability to correlate positively with one another, the factors 
which have been found also tend to correlate positively with one an- 
other. This is a partial confirmation of Spearman's G. Spearman was both 
correct and incorrect: there is a moderate tendency for the individual 
s, but there are also 


to do as well or as poorly on nearly all ability tes 
definite clusterings of tests into separable factors. 
How Many Factors? Factor analvsts have differed not so much in the 
kinds of factors they have found, but in how many factors they have 
derived. The number of factors used to explain ability tests is dependent 
wants to attack the problem. One could be content 
flected in the tendency for tests to corre- 
late positively with one another. Or, the factor analyst can stop with 
the several most prominent factors, those which explain most of the 
test variance. Most of the British factor analysts have been content 
with exploring only the major factors. At the other extreme, one can 
continuously search for more and more factors of less and less impor- 


tance. 
An analogy may help in portr 


on how finely one 
with only the general factor re 


aying the question of how many factors 
should be derived. In constructing a topographic map, the cartographer 
; how finely differences in elevation will be shown in the 
1 used to show differences in elevation 
ar as one smooth peak. If contour 


has a choice as tc 
contour lines. If contour lines are 


of 1,000 feet, a mountain might appe i f 
lines were used to show differences in elevation of 200 feet, the increased 


detail might show that there are actually several peaks on the same base. 
Increasing the fineness of the contours to show differences in elevation of 
* iderably more detail about the terrain. One 


10 feet would bring out cons! 2 ; : 
ension to the use of contours to reflect inches 


stone would be visible. The cartographer 

must strike a balance between two opposing standards. The more grossly 

the contours reflect differences in elevation, the easier the terrain is to 

survey and to picture as 2 map. The more finely the contours reflect 

elevation, the more nearly the map approaches the actual terrain; but, 
s difficult to interpret. 


after a point, the map becomes plein as lie Ea per 
The factor analyst faces much the same problem as the cartographer: 


he must decide at what point a continued derivation of highly specialized 
factors no longer adds to the knowledge of himan: abil es. Most of the 
work in factor analysis has concentrated on a ee of nue factors 
rather than on exploring the usefulness of the doen us Hen are derived. 
Although the major factors have already proved useful as predictor tests, 


can imagine a logical ext 
of elevation, and then every 
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it remains to be seen whether or not some of the factors will add to the 
predictiveness of test batteries. When more research is done on the 
meaning and usefulness of factors, we will be in a better position to say 
which factors are important human attributes and which are merely 
artifacts of psychological tests. 

A Factor-analysis Problem. To illustrate the steps that are required in 
a factor analysis, the research reported by the Thurstones in 1941 (20) 
will be described. In an earlier analysis of tests given to college students, 
Thurstone and his colleagues had found nine factors (17). The purpose 
of the second analysis was to see if the same factors would be found in 
children. Sixty tests were constructed for this purpose—a considerable 
amount of test construction. Tests which had produced particular factors 
in previous analyses were included along with new tests, which it was 
hoped would point to undiscovered factors. In order for important factors 
to appear, it was necessary for the tests to range as broadly as possible 
across the ability functions. The tests varied from familiar forms like 
arithmetic and vocabulary to the solving of picture puzzles and the 
counting of dots on paper. 

Each test was separately standardized on a tryout group of children. 
Then test forms were reproduced and arrangements were made for the 
administration of materials. All 60 tests were given to 710 eighth-grade 
students in the Chicago schools. The tests were given, a half-dozen at a 
session, Over a period of 10 school days. 

All intercorrelations were computed for a 63 by 63 matrix. This in- 
cluded intercorrelations of the 60 tests plus the three additional variables 
of age, sex, and mental age (obtained from previous testing). This re- 
quired the computation of 1,953 correlation coefficients. Elaborate factor- 
analysis procedures were then applied to the correlation matrix, In all, it 
was a very large research venture, as all such substantial factor-analytic 
studies tend to be. 

The Thurstones found seven definable factors in the study of children. 
Previous research in England and America and research since the Thur- 
stones’ study have gencrally supported the seven factors and added a 
number of others. The major factors which have been found are described 
in the following section. 


VERBAL FACTORS 


(V.) Verbal Comprehension. This factor has been found in many 
different factor analyses (cf. 4, 6). It concerns the ability to understand 
words and written material. Some tests which have high loadings on Ve 
are vocabulary, sentence completion, and reading comprehension. 
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Typical items: 
1. Which one of the following words means most nearly the same as 
salutation? 
a. offering 
b. greeting 
c. discussion 
d. appeasement 
2. Which one of the following words is most nearly the opposite of 
languid? 
a. unemotional 
b. sad 
c. energetic 
d. healthy 
(Vj) Word Fluency. This is another verbal factor that has been found 
in numerous studies (see 4, 6). It is concerned with the production of 
words, in contrast to Ve which is concerned with the understanding 
of words. Any test which requires the subject to produce words rapidly 
has some loading on V/ Although there is a moderate correlation between 
V. and V; a correlation of .42 in the study described above (20)—the 
two abilities are well distinguished. It is possible for an individual to be 
with little understanding of the meaning of 
prominent when there is a range of diffi- 
s more important when the words are 
simple and the concern is with how rapidly the words are produced. The 
contrast between V, and V, fits in well with the observation that there are 
individuals who have difficulty in expressing their own ideas but are 
quite good at understanding written material. 
Typical items: 


able to produce words rapidly 
words. The factor V, tends to be 
culty in words. Factor V; become 


l. Write as many names of foods as you can in the next two minutes. 


2. In each of the following rows write three words that mean almost 


the same as the given word. 


small 

helpful 

kind 
(Note that the words here 
required to produce the new terms.) 


are simple and that the subject is 
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NUMBER FACTORS 


(N) Numerical Computation. One very clear numerical factor has 
been found in different analyses (see 4, 6). It concerns the speed of solv- 
ing arithmetic problems of all kinds—addition, subtraction, multiplication, 
and division problems. The factor is not present in all tests that employ 
numbers but only in those tests where actual computations are required. 
Some of the perceptual factors and reasoning factors to be discussed 
employ numbers but have little of the factor N. 

Typical items: 


246 8754 
7-90 — c 1 
4043 381 162 — —— W= 


REASONING FACTORS 


Reasoning is a complex domain of attributes. Some of the factors have 
been found in a number of investigations and others are still in dispute 
(see 6, 8,9). 

(R,) General Reasoning. The most common, and most commonly 
found, factor of reasoning is concerned with the ability to invent solutions 
to problems. Arithmetical reasoning problems are most characteristic 
of R; 

Typical items: 


1. If two men can build one house in twelve weeks, how many 
houses can twelve men build in two weeks? 

2. How would you get exactly 7 quarts of water from a stream if you 
had one 5-quart container and one 3-quart container? i 


(R4) Deduction. This factor is concerned with the drawing of conclu- 
sions, as in logical syllogisms. In this type of reasoning there is nothing 
in particular to be discovered or invented, the ability being concerned 
with evaluating the implications of an argument (see 8, 9) 

Typical items: 


1. John is younger than Fred. 
Bill is older than Fred; therefore, Bill is — — than John. 
9. A student has 10 marbles. No one else in his class has 10 marbles. 
This means that 
a. no one else in the class has marbles. 
b. all of the other students have less than 10 marbles. 
c. some of the students have less than 10 marbles. 
d. some of the students have more than 10 marbles. 
e. only one student has exactly 10 marbles. 
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(R.) Eduction of Relationships. A third factor of reasoning has 
appeared often enough to be cited here, but it is not as firmly established 
as R, and N. (sce 8. 9). The factor involves the ability to see the relation- 
ship between two things or ideas and to use the relationship to find other 
things or ideas. The factor is best represented by verbal analogies and 
design analogies. 

Typical item: 


Ship is to sail, as automobile is to 
a. ship 
b. scat 
c. motor 
d. wind 
€. driver 


MEMORY FACTORS 


factors is complex and only partially explored. 


The domain of memory 
Kelly (10) has helped considerably to clarify 


A recent factor analysis by 
the area. i 
(M,) Rote Memory. The 
the ability to remember simple 
no importance (see 4, 6, 10). 
Typical item: 


best-established factor of memory concerns 
associations where meaning is of little or 


The subject is given a list of names, each of which is paired with a 
viven a minute or so to memorize which number goes 
Then he is told to turn the page. The next page 
mes without the numbers. The subject is 
ers next to the names. 


number. He is give 
with which name. 
contains the list of n: 
instructed to write the proper numb 
Other items which concern M, are of the same general kind as the one 
shown above. They would be sucl airing colors with words, initials 


with last names, and Photographs ; 1 T 
(M e ene There is substantial evidence to indicate 
m g 1 


that there is a factor involved in the retention of meaningful relationships 
which is separable from rote memory (see 10). Factor Mn appears when 
the nce is requested to memorize sentences, meaningfully related 


words, and lines of poetry- 
Typical items: 


as p 
with geometrical forms. 


1. The subject is asked to read and try to remember a list of sentences 
ike the ying: 
lise sla on by welding the broken axle. 


hn repaired the wag n ax 
E lis of sentences is taken away and the subject is then given 
e 
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the same sentences with one or more of the words deleted from 
each, like the following: 
John repaired the wagon by welding the broken — — 
2. The subject is shown a list of meaningfully related pairs of words 
such as the following: 


dog bark 
shoe leather 
hard candy 
small box 


The list is taken away and the subject is presented with only one 
member of each pair as follows: 

dog m 

shoe 

hard 

small 


There is evidence for several other memory factors. A number of inves- 
tigators have reported a memory span factor concerning the ability to 
recall perfectly for immediate reproduction a series of unrelated items 
(cf. 6, 10). A typical item would be to read a series of five to a dozen 
numbers and ask the subject to give the numbers back in their exact order. 
There is also some evidence for a visual memory factor in which the abil- 
ity to grasp the relationships within a picture or pattern is important (see 
6, 10). A typical item would consist of showing an individual a landscape 
picture and asking him to remember the details. Then the picture would 
be taken away, and the subject would be presented questions like “How 
many sheep were in the picture?” “What was the boy handing to the 
man?” “Where was the swing located?” The visual memory factor might 
be related to the ability to remember faces, license numbers, and wit- 
nessed events. 


SPATIAL FACTORS 


(S,) Spatial Orientation. This factor concerns the ability to detect 
accurately the spatial arrangements of objects in respect to one’s own body 
(see 6, 18). The factor would be necessary in deciphering pictures taken 
from a maneuvering airplane. If the plane is simultaneously turning and 
climbing, the landscape looks very different from the normal view that 
we get. The individual who can accurately detect what maneuver the 
airplane is going through from looking at only a picture of the landscape 
from that vantage point is high in spatial orientation. The factor appears 
most prominently when the spatial problems are presented under 
“speeded” conditions. 
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(S.) Spatial Visualization. This second spatial factor differs in a subtle 
manner from S/ It is present when the subject has the ability to imagine, 
or "visualize," how an object would look if its spatial position were 
changed. Although there is good statistical support for both of these fac- 
tors there has been considerable difficulty in understanding the under- 
lying processes (see 6, 9, 18). Spatial orientation seems to require either 
an actual or imagined adjustment of one's own body. In spatial visualiza- 
tion the observer cannot solve the problem by a bodily adjustment; 
instead, he must conceive of how an object would look if its spatial position 
were markedly changed. In contrast to spatial orientation, spatial visuali- 
zation is best tested under relatively “unspeeded” conditions. 

Typical item: 

The subject is shown a folded piece of paper with a number of holes 
punched in it. He is asked to choose from a number of alternatives 


how the paper would look if unfolded. 


PERCEPTUAL FACTORS 


A number of factors have been found which concern the ability to 
detect accurately visual patterns and to see relationships within and 
among patterns. Some of these factors are apparently of only limited 
importance, such as the ability to judge certain types of illusions. Several 
of the more important factors will be described. ; 

P.) Perceptual Speed. This factor is concerned with in rapid recogni- 
tion of perceptual details and particularly with the similarities and dif- 
ferences between visual patterns (see 9, 18). 

Typical items: 
etrical form and asked to choose from 
one that is the same as the first. 

a check mark by each pair of letter 
al and to make no mark if they are 


1. The subject is shown a geom 
a number of other forms the 
2. The subject is told to make 
groupings if they are identic 


different. ; 
* . Iq X'dt IQ 
a&80(k ^ a&3(oK 
ro- ro- 


(Fifty to several hundred pairs would be used, depending on 
the time allowed.) 

1 Closure. This factor concerns the perception of objects 
The words “closure” means a sudden 
relationship. Perceptual speed requires 
1 form. Perceptual closure requires the 


(Pe) Perceptua 
from limited cues (see 9, 18). 
awareness of an obscure object or 
only the recognition of a perceptua 
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putting together” of a perceptual form when only part of it is presented. 
(See Figure 9-5.) 

There is some evidence for the existence of a flexibility of closure factor 
in problems which require the subject to detect one perceptual pattern 
which is imbedded in a distracting or competing pattern (see 9, 18). This 
factor is found in such items as the hidden-picture games that are printed 


Ficure 9-5. Items pertaining to the Perceptual Closure Factor. The configuration in 
each frame is projected briefly on a screen. The subject must 3 the hidden 
word, letter, or number. (Adapted from Thurstone, 18, pp. 13, 21.) 


in newspapers and some children's magazines. At first glance, the picture 
looks like a normal landscape, but after careful scrutiny, a number of 
faces can be found hidden in the trees and rocks. In order to see the 
faces, it is necessary to resist or “break down” the perception of the 
object in which the faces are imbedded. 


SPEED 


The question of whether a test should be “speeded” or not has come up 
a number of times in previous chapters. Nearly all tests have a time limit 
for practical purposes, if for no other reason. In a vocabulary test, for 
example, there is no intention of hurrying the subject, and the time limit 
is usually liberal enough to satisfy most individuals. Some abilities are 
intimately connected with speed, such as typing and shorthand speed; and 
also some of the factors which have been discussed, such as word fluency 
and perceptual speed, can only be measured by the use of “speeded” tests. 

Considerable interest has been shown over the last fifty years in deter- 
mining whether or not spe 'ed and accuracy are the same in psychological 
tes Although the findings on this topic have not been consistent and the 
answer is not the same with all tasks, the general conclusion is that speed 
goes along with accuracy to an extent, but that these two attributes are far 
from perfectly correl lated. A comparison of scores given under “speeded” 
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and “nonspeeded” conditions shows that at least one and perhaps a num- 
ber of speed factors can be isolated (see 4, 6). It is not known at this 
time just how speed factors interact with the other ability factors. 


MULTIFACTOR TEST BATTERIES 


From this point on, the book will take a practical slant with the exami- 
nation of typical tests and measurement methods in use in psychology. No 
effort will be made to give an exhaustive list of current tests. There are 
thousands of commercial tests available, and any effort to describe a 
sizable fraction of these would provide very dull reading. Tests come and 
go, and better ones will always be forthcoming as our knowledge of test- 
ing increases. Rather than make this a catalogue of tests, the book is 
intended to impart a logic for measurement methods which can be applied 
to tests in general. A list of test publishers is presented in Appendix 1. The 
individual who uses tests in professional work will want to obtain test 
catalogues and in this way keep up with new tests as they are developed. 

In the sections to follow, several tests will be described to illustrate the 
Measurement of each type of human attribute. The usual procedure will 
be first to discuss one of the older tests in each area to provide historical 
perspective; then, one or several of the leading tests in the area will be 
discussed. 

Judging a Multifactor Battery. Differential aptitude tests can be con- 
structed either rationally or from the results of a factor analysis. Preferably 
the battery should come from a factor analysis. Otherwise there is no 
guarantee that the tests will measure different attributes or that the tests 
will cover a wide range of attributes. If the battery is derived from factor 
analysis, each of the tests should have high “factorial validity.” A test has 
high “factorial validity” if it correlates highly (has a high loading) with 
the factor it is intended to measure. M 

The statistical necessities of efficient differential prediction as described 
in Chapter 7 apply to all differential aptitude, or multifactor, batteries. 
Reiterating, cach of the subtests should have high individual reliability 
and should correlate as low as possible with one another. On the practical 
side, differential aptitude batteries should be judged in terms of the 
amount of research which has been done on the instrument, the repre- 
sentativeness of the norms, the appropriateness of the battery for different 
groups of persons, and the time and expense of dip dem mee ARR 

The Primary Mental Abilities Test (PMA). The i A tes o "E deve 4 
oped directly out of the factor-analytic work of 8 8 1 0 ànd his col- 
leagues, one phase of which was described n Y nc was Tus first 
comprehensive multifactor battery and Es " a be d o t 
logical measurement. The first regular edition appeared in . It was 
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called the Chicago PMA Tests for Ages 11 to 17. Three tests were used 
to represent each of six factors. The following factors were represented: 
(1) rote memory, (2) verbal comprehension, (3) word fluency, (4) nu- 
merical computation, (5) spatial orientation, and (6) general reasoning. 
Science Research Associates (SRA) later assumed publication of the 
PMA series and issued forms for the 5-to-7, and 7-to-11, as well as for the 
11-to-17-year levels. The batteries have been considerably shortened, with 
only one short test being used to measure each factor. Some of the factors 
are not represented at one or more of the age levels: for example, the rote 
memory factor is dropped from the 11-to-17-year battery. Figure 9-6 
shows typical items which appear on current forms of the PMA. 
Evaluation of the PMA. In spite of the very important research which 
preceded the PMA, the current tests fall short of the standards of an 
efficient multifactor battery. The fact that only one short test is used to 


Ficure 9-6. Sample Items from the SRA Primary Mental Abilities Tests for Ages 11 


to 17. (Reproduction by permission of Thelma G. Thurstone and Science Research 
Associates. ) 


Verbal meaning is tested by multiple choice vocabulary items like the following: 
Safe: A. secure B. loyal C. passive D. young 


The word which means most nearly the same as the first word is to be marked. 


Space is tested by items like the following: 


[ J C11 


(a) (2) (c) ta) (e] (Ff) 


Every figure that could be obtained by a rotation of the first figure is to be marked. 


Reasoning is tested by items like the following: 


Layee) tel Ta) 
The letters form a series based on a rule. The subject is to discover the rule and 
mark the next letter in the series. 


abmcdmefmghm E 


Number is tested by items like the following: 
63 
7 [Xp Ww 
89 
140 


Word fluency is tested by requiring the subject to write as many words beginning 
with a certain letter as possible during five minutes. 
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measure each of the factors has two unfortunate consequences. First, 
to have high factorial validity, it is usually necessary to use several tests. 
By averaging scores, much of the specificity of each test is canceled and 
more of the underlying common factor is tapped. Secondly, either aver- 
aging the scores on several tests for each factor or making one longer test 
for each factor raises the reliability. 

In some forms of the SRA PMA, reliability statistics have been either 
incorrectly computed or inadequately reported. Split-half reliability esti- 
mates have been made without regard to the highly speeded nature of 
some of the tests. In a restudy of the PMA reliability, Anastasi (1, p. 366) 
found that when properly computed some of the reliabilities fell mark- 
edly. For example, the word fluency test reliability dropped from a 
reported .90 to an actual .72, and the spatial relations test from .96 to .75. 
These reliabilities are too low for an adequate multifactor battery. 

d with the insufficient reliability of some of the PMA sub- 
tests, there are moderately high correlations among them. For the 
seven-to-eleven-year form, intercorrelations were found to range from .41 
to .70; and for the five-to-seven-year battery, intercorrelations ranged from 
51 to .73. The combination of relatively low reliabilities on some subtests 
and generally high intercorrelations of subtests considerably weakens the 
PMA as a multifactor battery. 
Although the PMA test man 


Compounde 


uals make claims for the specific vocational 


and educational implications of scores, only a small amount of actual re- 
search has been done with the battery. Although norms are given for the 
battery, very little information has been obtained about the nature of the 
populations. tested. No information is given on sex differences even for 
the subtests on which significant sex differences are usually found. 

The PMA tests have not kept up with the increasing knowledge about 
ability factors. For example, the reasoning test employed in the 11-to-17- 
year battery, the series completion type of test, has since been found to 
be a mixture of factors and is not the best test to measure general reason- 
ing. Also, a more modern multifactor battery should include several kinds 
of reasoning tests in line with more recent factor-analytic findings. Sub- 
sequent research might show that the PMA tests depend excessively on 
speed. Some of the tests, such as word fluency, should be speeded, but it 
is doubtful that speed will nadie the predictiveness of other tests, such 
as reas ing : serbal comprehension. 
as Cr Mu nae Tests. The Differential Aptitude Tests 
(DAT) were developed by The Psychological Corporation, principally 

d educational guidance of high school stu- 


, i » vocational an i 
nhar 2) pres the tests were intended for use with students in 
grades 8 through 12 they have a sufficient range of item difficulty to be 


used with most adult groups. The tests were not developed directly out 
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of factor-analytic work but were composed in such a way as to incorporate 
some of the major findings from factor analysis. 

The subtests of the DAT are as follows (see Figure 9-7 for sample 
items): 

1. Verbal Reasoning. The items consist of verbal analogies, in which the 
reasoning component rather than the difficulty of words is emphasized. 
The test is more concerned with the reasoning factors than with verbal 
comprehension. 

2. Numerical Ability. The test covers a wide range of numerical compu- 
tations. It should be a good measure of the numerical computation factor, 
as it was previously described. 

3. Abstract Reasoning. This test differs from the verbal reasoning test 
in that all of the problems deal with abstract patterns. 

4. Spatial Relations. The test items concern the individual's ability to 
imagine how objects would look if they were rotated in space and to 
visualize a three-dimensional object from a two-dimensional pattern. 

5. Mechanical Reasoning. The test items consist of pictures which por- 
tray mechanical problems. The subject is asked questions about each 
picture. 

6. Clerical Speed and Accuracy. The test is modeled closely after the 
perceptual speed factor. The subject is required to find identical sets of 
numbers and figures. 

7. Language Usage. This test is more concerned with acquired knowl- 
edge, or achievement, than with a specific aptitude. The two parts of the 
test concern spelling ability and grammatical usage in sentences. 

Evaluation of the DAT. Although the DAT was not derived directly 
from factor analysis, the tests represent a collection of reasonably inde- 
pendent measures which range broadly over those factors most directly 
related to school achievement. The tests were constructed primarily for 


FicvnE 9-7. Sample items from the Differential Aptitude Tests. (Reproduced by per- 
mission of The Psychological Corporation. ) 


Verbal Reasoning 


at ne is to one as second is to... 
1. middle 2. queen 3. rain 4. first 
A. two B. fire C. object D. hill 
.....is to night as breakfast is to 
l. flow 2. gentle 3. supper 4. door 
A. include B. morning C. enjoy D. corner 


Pick out words which will fill the blanks so that the sentences will be true and 
sensible. For the first blank, pick out one of the numbered words. For the second 
blank, pick out one of the lettered words. 
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Ficure 9-7.—Continued 
Abstract Reasoning 


Problem figures Answer figures 


I 


Answer figures 


Problem figures 


PONG GLICO 


(a) (5) (c) 


are called Problem Figures and five called 
a series. You are to find which one 


Each row consists of four figures that 
Answer Figures. The four Problem Figures make 
of the Answer Figures would be the next, or fifth one, in the series. 


Space Relations 


(o) (b) tee (% le) 
(o) tb) (c) (n (e) 


The test consists of forty patterns which can be folded into figures. For each pat- 
tern, five figures are shown. You are to decide which of these figures can be made 
from the pattern shown. 
Numerical Ability 


Add 13 A 14 Subtract 30 A 15 
20 B 26 
12 p 25 — 
— C 16 
C 16 
D 59 D 3 
E none of these 


none of these 


ti 


Choose the correct answer. 
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Ficure 9-7.—Continued 


Mechanical Reasoning 


Which man has the heavier 
load? (If equal, mark C.) 


Which weighs more? 
(If equal, mark C.) 


Language Usage Part I: Spelling 


man 
gurl 
catt 
dog 


Indicate whether or not each word is spelled correctly. 
Language Usage Part II: Sentences 
Ain't we / going to the / office / next week / at all. 
A B C D E 
The test consists of a series of sentences, each divided into five parts, lettered A, B, 


C, D, and E. You are to look at each and decide which of the lettered parts have 
errors in grammar, punctuation, or spelling. 
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Figure 9-7.—Continued 


Clerical speed and accuracy 


SAMPLE OF ANSWER SHEET 


Test ITEMS 


V. AB AC AD AE AF 
W.aA aB BA Ba Bb 
X. A7 7A B7 ZB AB 

Aa Ba bA BA bB 


Z 8A SB 33 BS BB 


— — 
Each test item consists of five letter and number combinations. You are to look at 


the one combination that is underlined and find the same combination on the answer 


sheet. (The examples above are all correctly done.) 


school programs, and it is doubtful that they would meet with the same 
level of success in other testing situations. The correlations between the 
tests vary considerably, going from .06 to as high as .67. The intercorre- 
lations run about .50 on the average. Although it is desirable to have lower 
Correlations, they are not so high as to prevent the battery from function- 
ing as a measure of differential aptitudes. 

The reliabilities of the tests are generally high, ranging from .85 to .98 for 
all except the mechanical comprehension test. The reliability for men on 
the mechanical comprehension test is sufficiently high, with a mean coeffi- 
cient of .85, but the reliability for women is only .71. 


TABLE 9-6. Menan ConngLATIONS oF DIFFERENTIAL ÁPTITUDE Test Scores 
WITH SCHOOL GRADES 


(Adapted from G. K. Bennett, et al., 2) 


Soc. 
Eng- Sci- stud., Lan- Typ- Short- 
Test lish Math. ence his- guages ing hand 
tory 
Verbal Reasoning (VR) 50 39 oe 48 3p 18 44 
Numerical Ability (NA) 48 50 a "i 58 = 27 
Abstract Reasoning (AR) 36 35 5 25 27 24 
Space Relations (SR) 27 32 36 25 19 16 16 
: 24 22 38 24 17 14 14 


Eier ul Reasoning (MR) 
erical Speed a “curacy 

(CoA) peed and Accuracy e m "T - 8 i: M 
Spelling (Spell.) u x3 8 39 3 4 os 
Sentences (Sent) 52 36 48 46 40 30 49 


UT DW rU EL TE TEE EE E 


°? Decimal points omitted. 
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A particular point in favor of the DAT is the large amount of research 
that went into the standardization and validation of the instrument. The 
norms are based on the testing of 47,000 students in grades 8 to 12 scat- 
tered widely over the United States. Because sizable sex differences were 
found on some of the tests, separate norms are given for boys and girls. 

Thousands of correlations between DAT scores and various criteria 
have been computed. Some correlations between test scores and class 
grades are shown in Table 9-6. A follow-up study was undertaken of 
students two years after completing high school. The results are shown in 


TABLE 9-7. PERCENTILE EQUIVALENTS oF AVERAGE SCORES on THE DAT 
FOR MEN IN VARIOUS EDUCATIONAL AND OCCUPATIONAL Groups 


(Adapted from G. K. Bennett, et al., 2) 


Percentiles 
No. VR NA AR SR MR CSA Spell. Sent. 


Group 


Degree-secking students: 


Premedical 24 88 86 81 72 77 77 90 90 
Science (biology, chemis- 

try, and math.) 25° 81 85 60 67 68 74 81 72 
Engineering (includes 

architectural ) 70 80 86 80 81 82 67 68 74 
Liberal arts ( includes 

prelaw) 68 79 75 78 61 64 75 78 81 
Business administration 64 72 73 61 63 60 71 67 68 
Education (includes 

physical education) 25 68 66 58 57 48 68 67 66 
Various: predental, agri- 

cultural, ete. 30 64 67 74 60 73 59 55 64 


Non-degree students in 
two-vear schools: 
Business, technical, fine 


arts, etc. 43 63 62 71 72 68 60 54 61 
Employed: 

Salesmen 23 56 53 53 52 57 44 50 49 

Clerks: general office work 55 45 42 52 44 4 41 52 53 

Mechanical, electrical, and 

building trades 66 34 41 46 49 56 44 28 29 

Various skilled: butcher, 

baker, ete. 26 47 87 43 41 49 51. 52 45 

Various unskilled: truck 

driver, laborer, ete. 8 35 30 36 42 49 37 35 36 
Military service 129 46 42 46 51 50 49 46 41 


Unclassified: 
No consistent work or 
school record 58 53 48 51 54 54 46 47 52 
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Table 9-7. In Table 9-7, scores for the total group are broken down in 
terms of subsequent occupation or college specialty. Some of the groups 
in Table 9-7 are represented by only several dozen individuals, and con- 
sequently, the results should be considered tentative. However, there is 
an apparent tendency for occupations to be represented not only by differ- 
ences in performance level but by “shape” characteristics as well. More 
research of this kind will provide the DAT and other similar test batteries 
with a firmer basis for vocational counseling. The DAT is a model of 
careful test design, practicality, thoroughness of research, and frankness 
of reporting. k 

Other Differential Aptitude Batteries. A number of other batteries are 
available. Some of these will be listed and briefly described. 

General Aptitude Test Battery (GATB). This is a very broad battery of 
15 tests constructed by the U.S. Employment Service. More attention 
would be given to the battery here, but unfortunately the tests are avail- 
able only to state employment services. The tests were developed directly 
from factor-analytic studies, and, in most cases, scores from more than one 
test are used for each factor. The following attributes are measured: 
(G) called “general intelligence,” (V) verbal aptitude, (N) numerical 
aptitude, (S) spatial aptitude, (P) form perception, (Q) Min icm 
tion, (A) aiming, (T) motor, (F) finger dexterity, and (M) manua 


dexterity, : 
Flanagan Aptitude Classification Tests (FACT). The battery consists 
of 14 relatively independent tests ranged broadly across the abiy func- 
tions, The battery is published by Science Research Aae = eight- 
ings are given for the prediction of 30 occupational groupings. Almost no 
evidence is presented to support the claims of validity. 5 
Guilford-Zimmerman Aptitude Survey. The tests in this series a an 
important effort to represent a large share of the known factors of human 
abilities. The tests are constructed on the basis of factor analyses under- 
taken in the Air Force. Each test is intended to have high factoriak validity 
for one of the known factors. The present battery includes tests for the 


following factors: verbal comprehension, general reasoning, numerical 
dr p atial orientation, spatial visualization, 


computation, nerceptual speed, Sp! i | pee ag ae 
and mone Ao knowledge (not an aptitude factor). Tests are being 
Prepar -epresent more factors. : 

„ has been done with the battery to 


An insufficient amount of research 
know how well it will work. The evidence so far suggests, as would be 


expected, that using only one test does not provide an adequate measure 
of each factor, The battery is still in an experimental stage. i 
The Status of Multifactor Batteries. Whereas TAN has been a consid- 
erable amount of research done to find new factors of human ability, 
relatively little research has been done to develop test batteries for prac- 
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tical use. This is due in part to the tendency for measurement specialists 
to be more interested in the theoretical aspects of the problem than in the 
more mundane practical aspects. Another reason is that test batteries can 
often be sold for considerable profit, and it is tempting to place tests on 
the market before their actual worth has been determined 
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CHAPTER 10 


General Intelligence Tests 


nough along to permit a more direct explana- 
tion of what “general intelligence” tests measure. In Chapter 9 it was 
shown that the domain of intellectual abilities is not entirely “general,” 
that a number of separable factors can be shown to underlie tests of 
human ability. The tests which will be discussed in this chapter are gen- 
eral instruments in that the individual receives only one score. The one 
score is an over-all, or general, measure of his ability to perform the 
different kinds of material in the test. 

What the General Intelligence Tests Measure. Having gone to some 
lengths to show that the concept of “general intelligence” is somewhat 
misleading, at this point let us consider the arguments supporting the 
practical use of general measures. The general intelligence tests take 
advantage of the fact that the separable factors of intellect are themselves 
correlated. In addition, not all of the factors are intimately related to 
intelligence as it is popularly conceived. Factors such as verbal compre- 
hension and general reasoning are more commonly thought of as consti- 
tuting intelligent behavior than are factors like spatial orientation and per- 
ceptual speed. Therefore, the general intelligence tests have tended to 
incorporate material largely from the factors which more nearly fit the 
Popular conception of intelligence. The use of one score represents pri- 
marily the adding together of only several factors rather than the many 


that can be found in diverse tests. 
The content of most general int 


The text discussion is far e 


elligence tests is predominantly related 
to the verbal comprehension factor, the numerical computation factor, and 
Various mixtures of the reasoning factors. The remaining material in 
these tests tends to be scattered thinly over the memory, spatial, and 
Perceptual factors. Most of the intelligence tests correlate highly with 
One another, coefficients being commonly found between 
some of them: However, the correlations are in most pases far enough 
from perfect to show that the tests measure partially different things, or 
factors. Factor-analytic studies have been made of some of the major 
tests, and these findings will be discussed in the pages ahead. 
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as high as .80 
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Some psychologists argue for the use i n 5 8 
have not given up the concept of general inte 1 a 
icious of factor-analytic findings and maintain a ho istic viev 5s » 
Means Their viewpoint is that the individual's effective inte e: 
comes from a pattern or 5 sd ws factors that is di 
ict from scores on the separate ; 
ea er particularly in ed to school work. It is cian 
for individuals to solve the same problem successfully yet use differ ie 
approaches. For example, one individual may rely heavily on his P 
ability to help solve geometrical problems. Another person may solve the 
same problems more abstractly by relying on his reasoning ability. 
Another argument for the holistic conception of intelligence is that the 
grades in different school courses tend to correlate highly with one 
another. This is perhaps partially the fault of teachers, who tend to form 
over-all impressions of students; but it also indicates that to an extent 
the underlying factors are somewhat interchangeable in real-life situa- 
tions. Regardless of the theoretical merit of the holistic viewpoint, the 
kinds of tests which have traditionally been used as measures of general 
intelligence are composed of a number of different factors. 
Another argument for the use of general intelligence tests arises from 
the defects in many multifactor, or differential aptitude, batteries. It was 
previously said that an efficient differential aptitude battery must have 
highly reliable subtests which correlate low or only moderately with one 
wise, the battery will manifest only statistically insignifi- 
cant differential scores, and it would be as well to add the tests together 
to form one dependable measure. This is a telling argument against some 
of the less carefully constructed multifactor batterie 
research is done on the construction of multi. 
sts will probably 


actors. There is some support 


s. However, as more 


actor batteries, the general 


intelligence te assume less import 


ance as measurement 
devices. 


INDIVIDUAL VERSUS GROUP TE 


Tests can be given to each indiv: 


viduals can be tested at one time. Althou 
tests of all kinds, it is a particul 


intelligence tests. The multifactor batteries are almost always group tests. 
There is considerable competition between individual and group tests of 
general intelligence. Numerous articles have appeared in the 
logical literature arguing whether group or individual te 
with particular age groups. 

Individual Tests. The 
Simon scale (9), 


STS 
idual Separately, or 
igh this is 
arly import: 


a group of indi- 
a consideration for 
ant issue in the use of general 


psycho- 
Sts should be used 


first practical genera] intelligence test, the Binet- 
was administered individual] 


y- This was necessary 
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because the subjects were young children. The individual test requires a 
highly experienced examiner. The examiner is in essence a part of the 
standardized testing procedure, and he must standardize his own treat- 
ment of the child to conform to established methods. It is not necessarily 
the examiner's job to get the “best” score from the subject but to obtain a 
score which is the same as would be found by other expert examiners. If 
one examiner is more or less friendly and encouraging than another, this 
will inevitably lower the test reliability. 

Many of the items on individual tests cannot be scored unambiguously 
as right or wrong. Instead there may be a number of acceptable responses, 
and different scores are often required to indicate the degree of correct- 
ness. The better-established individual tests go to considerable effort to 
Specify just what will be considered a correct response and how much 
credit should be given for a response. The examiner must follow the estab- 
lished scoring procedures meticulously and not permit his subjective 
judgment of a child to influence the test results. Any idiosyncracies in 
scoring will make the test less reliable. E 

Group Tests. As was discussed previously, the first practical group tests 
of general intelligence were developed for the Armed Forces during the 
First World War. Here again necessity was the mother of invention: the 
number of men to be tested required a quick and economical measure- 
ment device, and the individual test was unsuited for that purpose. 

Most group tests are sufficiently self-explanatory so that the test exam- 
iner need have little or no specialized knowledge of testing procedures. 
The test forms are simply passed out to subjects; either they are allowed 
to work at their chosen rate or the examiner directs the subjects when 
to start : 

e m) Individual and Group Tests. Although the following 
rules are not precisely correct for all group and individual tests, the rules 
are sufficiently general to offer a reasonable basis for choosing between 
the two kinds of tests in most situations. 

l. Individual tests are required with young children. Starting at the 
earliest age, there are no group tests that can be used with infants. Each 
infant must be examined separately, and any effort at standardization of 
procedures is difficult. With preschool children and children in the early 
grades of grammar school, it is usually stron to ise individual tests. 
Young children either cannot read at all or they lack the reading ability 
to take the self-explanatory group forms. As an additional factor, young 
children are highly distractable, and it is all that the expert examiner can 
do to keep one child working at the test materials, Young children are 
ated to do well on tests, and it is only through the exam- 


often not motiv A 
ardized, encouragement that a meaningful meas- 


iner’s careful, but stand 
ure can be obtained. 
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2. Group tests are generally more desirable in testin 
The well-standardized group tests prove to be 
individual tests when working with most tee 
Although the evidence is clear that individual 
for young children and that group tests do as 
older, it is not certain which kind of test is 
in-between age group. 

Adolescents and older persons are usuall 
enough to manifest a meaningful score on a group test. Also, they are 
less embarrassed by the group testing situation than they would be in the 
face-to-face individual testing situation. 

Because group tests tend to be equally valid as individual tests when 
administered to adults, group tests are much to be preferred in practical 
work. Group tests are much less expensive and time consuming. It takes 
no more time to administer and score 100 Sroup tests than it does to 
administer and score 1 individual test, 

3. Individual tests are often useful in clinica 
chologist can often learn considerably more fr 
the subject’s score would indicate, The ch 
classroom may be only hard of hearing. An 
school work because he wants to do Wrong answers when 
he knows what correct. The older ay appear to be “demented” 
because he is discouraged ; . These things probably wodd 
not be found in group testing situations, but an experienced clinician can 
often use the individual testi ituation to diagnose why an individual 


g “normal” adults. 
as good predictors as the 
nage and adult groups. 
tests are better predictors 
well with adolescents or 
Senerally more valid with the 


y motivated*and attentive 


l settings. The clinical psy- 
om the individual test than 
ild who appears dull in the 
other child may do poorly in 


y easier to construct than i 
ans to construct a predictor test, 
ertaken unless measurement spe 
and it is planned to spend considerable time and money in test construc- 
tion. The difficulties in constructing test materials, standardizing the test, 
and particularly the setting of instructions for administration and scoring. 
make the development of an individual test a time-consuming eraut 
job. Training test examiners for the individual tests seater additions! 


an individual test 
Cialists are available 


expense, 


VERBAL VERSUS PERFORMANCE TESTS 


Verbal requirements 


1. Understand spoken language 
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2. Understand written language 
3. Speak language 
4. Write language 
5. Verbal comprehension factor 
There are many different combinations of the above verbal require- 
ments in particular tests. A test may require the first four items in the list, 
but deal with language at so simple a level that very little ability in verbal 
comprehension is required to obtain a high score. It is possible to make 
up a test in which almost none of the five aspects is required. This can be 
done by giving the test instructions in pantomime and using test materials 
that require neither written nor spoken responses. It is also possible to 
compose a test in which none of the first four aspects is present and the 
fifth aspect, verbal comprehension, is a cardinal requirement. If the test 
requires the child to manipulate abstract symbols or to deal with pictures, 
this will tap, in part, the verbal comprehension factor. Each test should 
be examined in terms of its combination of verbal requirements rather 
than simply classified as "verbal" or “nonverbal.” 
Another distinction can be made between tests in terms of the way in 
which responses are made: 
Nature of response 
1, Symbolic response. 


The subject indicates the correct answer either 
through the use of language or by marking one of a number of 
choices. The symbolic response might be made in respect to 
objects rather than printed materials, although this is usually not 
done. 

Manipulative response (performance ). The subject is required to 
handle objects in such a way as to complete a specified product. 
The product may be anything from a completely finished piece of 
machinery to the arrangement ofa set of blocks. 

Some items are not clearly differentiated in terms of the two kinds of 
responses. For example, in maze tracing, the child is required to coordi- 
nate the pencil and move to the goal—the response is as symbolic as it is 


10 


manipulative. r « » i 
It has been the custom to call instruments “performance” tests if they 


deemphasize language requirements, employ three-dimensional materials, 
and require manipulative responses. Because of these components, the 
ally measure motor coordination, speed, and per- 


"performance" tests usu? Al 2 
Instruments are usually referred to as “verbal 


ceptual and spatial factors. 
tests if they are printed forms, emphasize verbal comprehension, and 


require symbolic responses. Because of the ease with which certain kinds 
of test materials can be placed on printed forms, the "verbal" tests tend 
foxineasute the verbal comprehension, numerical computation, and the 
reasoning factors. T here is no clear-cut separation between the factors 
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i 5 P re is a tendency for 
found in *verbal" and "performance tests, but "rua isa ) 
different factors to arise in the two kinds of materials. 


THE BINET TEST AND ITS FOLLOWERS 


The 1905 Binet-Simon test (see 33) consisted of the following 30 items: 


E o: oo zd m. eia Oe ix 


30. 


The list of questions w: 


. Define house, horse, fork, 
Repeat sente: 
. De 


Rank order weights of 3, 6, 9, 19 


Make rhymes, 
Complete se 


- Describe the best action to t al-life situations 
Tell what time it would be if the hour t 


. Makea drawing of how 


Follow a lighted match with head and eyes. 
- Grasp a cube placed on the palm. 

- Grasp a cube held in line of vision. 

- Make a choice between pieces of wood 
. Unwrap chocolate from paper. 
. Execute simple orders. 

- Touch head, nose, ear, cap, key 
. Point to objects which experim 
Name objects pointed out in 
judge which of two lines ist 


. Repeat immediately three d 
19. 


18. 


and chocolate. 


,and string. 

enter names in picture. 
à picture. 

he longer. 

igits read by examiner. 


Judge which of two weights is heavier, 


Solve problems that embody novel, ambiguous, or contradictory 
solutions, 


à single hearing. 
paper and cardboard; fly and butter- 
fly; wood and glass. 


d wild poppies; fly, butter- 
per, label, and picture, 


„and 15 grams. 


veight the examiner removes from No, 22. 


n deleted. 


ns the words Paris, gutter, and 
fortune. 


ake in 25 re 


hand 


n and minute hand were 
interchanged at 3:57 and at 5:40, 


the holes in a folded pie f paper Id 
look if the paper were unfolded, diss 


Describe what is different between liking 
between being sad and being bored, 


as tried on about fifty children 


and Tespecting, and 


and provisional 
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norms were established (see 33). It was found that the first five items 
could be passed by idiots and normal two-year-olds. Most three-year-olds 
could go no further than about the ninth item. Most five-year-olds could 
go no further than about the fourteenth item. 

After considerable use of the 1905 scale, a revision was published in 
1908 (see 33). Items in the earlier scale which did not seem to work well 
were eliminated and new ones were added. The outstanding feature of 
the 1908 revision was the grouping of items by age. A group of items was 
chosen to represent each age level so that about half the children at the 
age level could answer the items correctly. The level at which the child 
missed not more than one of the items was called his basal age. To the 
basal age was added one year for each five items he answered correctly 
at higher age levels. The basal age plus the years credit for items an- 
swered at higher levels constituted the child’s mental age. The mental age 
arrived at in this way was but a crude approximation of the child s intel- 
lectual standing. The number of tests at different age levels varied from 
three to eight, and this was not taken into account when figuring the 
mental age. No credit was given for fractional “years” on the scale, the 
mental ages being restricted to whole numbers. As a more general criti- 
cism, the selection of items and the determination of age levels were based 
on a rather small amount of standardization research. i 

American Adaptations of the Binet-Simon Scales. During the first dec- 
ade of this century the Binet-Simon scales became very popular in 
America. Instruments of this kind were needed in the care of the feeble- 
minded, in educational research, and in the understanding of juvenile 
delinquency. Goddard (19) made an early adaptation of the scales, prin- 
cipally translating them into English and making minor revisions, Later 
Kuhlmann (26) revised the scales and made an effort to extend the test 
downward to the age of three months. ; 

The Terman Revision of the Binet-Simon Scales. The first extensive 
revision of the Binet-Simon scales was undertaken by Terman and his 
colleagues at Stanford University. The new scale, published in 1916, was 
referred to as the Revised Stanford-Binet (46). The revision was so exten- 
sive as to constitute essentially a new test. Over one-third of the items in 
the revision were new and a number of items remaining from the original 
form were altered and moved to different age levels. 

For the first time, considerable empirical work went into the standardi- 
zation of the Binet type scales. The test was standardized on approxi- 
mately 1,000 children and 400 adults. An effort was made to select a 
sample of persons that would approximately Represent the population of 
the United States. Although the standardization sample left much to be 
desired by modern standards, the effort to obtain a representative group 
of people set a milestone in psychological measurement. 
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In the 1916 revision, careful attention was given to the an i 
instructions for test administration and scoring. The IQ type o sc x 
: ime in the Binet series of tests. Because of the 
was employed for the first time L etur 
careful work that went into the revision, the test became the stande 
clinical instrument for measuring intelligence, It w 
frequently than any other test during the twe 
Pris I) Revision. The second revision of the Stanford-Binet (47) was 
à ten-year undertaking. Instead of developing only 
lent forms, Form L and Form M, were produced. 
from the 30 items with which Binet began the serie: 
courtesy to Binet that his name is linked to the 1937 : 

Each of the new tests was longer than the 1916 test; different kinds of 
test items were used, and an even more strenuous effort at standardization 
was made. A large number of items was tried out on groups of children 
and adults. In all, about 1,500 persons were used in the empirical tryout 
of items. The principal criterion for selecting items was that they each 
have a "steep age gradient." That is, an item is desirable for the six-year 
level if few of the five-year-olds get the item correct, most of the seven- 
year-olds get it correct, and about half of the six-year-olds get the item 
correct. In order to find sets of items for the different 
erable amount of testing and statistical analysis was undertaken, 

After the selection of items and the Provisional construction of Form M 
and Form L, the forms were administered to a carefully selected sample 
of the United States population. The sample included 3,184 white, native- 
born persons from the ages of 114 to 18 years. An equal number of boys 
and girls was tested at each age level. The sample was drawn from 17 
communities in 1] widely Separated states, An effort was made to dis- 
tribute the Subjects in each age level equitably across the socioeconomic 
levels. Although the sample was not entirely representative of the United 


States population, there has never been a more strenuous effort to create 
a well-standardized test. 


The results obtained from the large sampling of re 
and Form M were again item analyzed. The items 
rearranged in such a manner that the mean IQ at each age level is very 
close to 100 and the mean IQ for boys and girls is nearly identical. An 
effort was also made to obtain equal standard deviations for the IQs at 
each age level. However, the rese rely successful in 
equating standard deviations. 

The Stanford-Binet test is still used 
other clinical instrument. There 
cal literature on the use of the t 
and on reliability and validity. 


as probably used more 
nty years after it was 


one form, two equiva- 
They are so different 
s that it is only out of 
tests. 


age levels, a consid- 


sponses to Form L 
in each form were 


archers were not enti 


as widely or more w 
are hundreds of articles in 


est, on developments in te 


idely than any 
the psychologi- 
sting procedure. 
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Administration of the Stanford-Binet. The test materials (see Figure 
10-1) include a box of performance items (beads, toys, pictures, etc.). 
test blanks on which the child’s responses are recorded, and a manual of 
instructions (47). As was mentioned previously, the administration of an 


ials used with the Stanford-Binet. (Courtesy of 


Ficune 10-1. Some of the test mater 
Houghton Mifflin Company.) 


individual test of this kind requires a highly trained examiner, and no 
faith should be placed in the scores obtained by an amateur tester. The 
test can usually be given ina period of fifty to seventy-five minutes. 

No individual is administered all of the items. Instead, the individual is 
Started slightly lower in the age scale than he is expected to reach. For 


ron A CHILD OF Seven YEARS AND SIX 


TABLE 10- >UTATION OF I 
A fr each of the age levels) 


Monrus (six items 


Test-year level Number of items passed Months’ credit per item Year credit 
VII 6 
VIII 6 (basal age) : 8 
x 2 2 if 
A l- * 
* 0 (maximal age) 


Total: 9 years 


9.0 
Mental age = 9.0 Chronological age = T9 1Q = (75) 100 — 120 
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ate s /en-vear-old off on 
Nui eta ins pls M p net i enn qutd d 
the five-year-level questions. 3 res x > gets all of the 
: an go. The highest level at which he gets a à 
a e ie basal te The level at which the subject gets 
Nee of the items correct is called the maximal age. The 1 EE 
age equals his basal age plus specific extra credits for se it PATTE 
the basal age. The IQ is then obtained by dividing mental age by 
logical age and multiplying the result by 100 (see Table dci Ne 
The Content of the Stanford-Binet. The present test still bears pus d A 
the earmarks of the original Binet emphasis on cultural artifacts: TUE : 
objects, and knowledge of the cultural environment. A AEA E 
words and their usage plays a predominant part in the scale, particularly 
at the upper levels. Although a number of items of the performance kis 
are used at the earlier ages (see Figure 10-1), these are also primari y 
concerned with the child's recognition and use of common everyday 
objects. Memory items are found throughout the scale, particularly in Ee 
immediate recall of digit series. A few items concerning perceptual anc 
spatial abilities are found at different age levels, Many of the items con 
cern one or more kinds of reasoning ability, To illustrate the nature of the 
scale, the items at four different age levels are described as follows: 
Two-year level 
1. Three-hole form board. The child is shown 
ing a cut-out square, circle, and triar 
and the child is asked to p 
ceives credit if all three objects are put in place. 
- Identifying objects by name. The 
(button, spoon, etc.) and is 
received for correctly naming f 
3. Identifying parts of body. The child is shown a 
doll and is asked to point to parts q 
dolly’s hair,” etc.) 
three of the parts. 
4. Block building. A box of 
examiner builds a tower of 
same. Credit is re 
blocks. 


subject’s mental 


a form board contan 
ngle. The pieces are removed 
ut them in their places. The child re- 


N 


child is shown six toy objects 
asked to identify them. Credit is 
our of the six objects. 

large paper 
of the body, (“Show me the 
Credit is received for correctly pointing out 


blocks is pl 


four blocks 
ceived if the child m 


aced before the child. The 
and asks the child to do the 
akes a tower of at least four 


ards containing pic- 
le is asked, “What is 

ast two of the objects. 
6. Word combinations. The 


examiner notes the 
of the child during the test, Cred 


spontaneous speech 
least 


it is given if the child uses at 


a two-word combination such as “see kitty.” 
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Six-year level 
1. Vocabulary. The child is asked the meaning of words from a 
graded list of 45 terms. Credit is given for five correct definitions. 
(The same list of words is used throughout all higher age levels.) 
2. Copying a bead chain from memory. The examiner makes a chain 
of seven beads using alternately a square and then a round bead. 
The child is asked to make a bead chain like the examiner's. Credit 
is given if all beads are placed correctly. 
3. Mutilated pictures. The child is shown five pictures in which an 
a wagon with only three wheels, 


object has a missing part, e. g., 
Credit is given for 


and asked to say what is missing in each. 
getting as many as four of the problems correct. 

4. Number concepts. Twelve blocks are put in front of the child. He 
is asked to give different numbers of the blocks to the examiner. 
Credit is given for selecting three correct numbers of blocks out 
of four trials. 

5. Pictorial likenesses and differences. The child is shown six cards 
with a number of figures on each. On each card one figure is 
different in some way from the others. The child is asked to point 
to the figure which is different. Credit is given for five correct 
responses out of six. f 

6. Maze tracing. The child is shown three designs each of which 
shows two ways for a person to get home. One route is longer 
than the other. The child is asked to trace the shorter route. Credit 
is given for two correct responses out of three. 

Ten-year level 

l. Vocabulary. The child is as Xo 
list. Credit is given for 11 or more correct definitions. XY 

2. Picture absurdities, II. A picture is shown in which an individual 
is acting in an illogical manner. The child is asked, What is 
foolish about that picture?" Credit is given if the child can ex- 


plain what is illo ical. 
3. Reading and report. 
material. He is then 
he read. Credit is obt 
in reading and at least ten events are recalled. 


indi dnd is asked to explain why two social 
4. Finding reasons, I. The child is asked pla y a 


rules are necessary, e. g. 
not be too noisy in school." 
sons. 

5. Word naming. The chil 
can in two minutes. Cre 


ked to define words from the standard 


sked to read a selection of 


The child is à 
as possible of what 


asked to recall as much 
ained if there are not more than two errors 


„Give two reasons why children should 
Credit is given for supplying two rea- 


d is asked to name as many words as he 
dit is given for 28 words or more. 
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6. Repeating six digits. Six digits such as 4, 8, 2, 1, 6, 3 are read at 
one-second intervals. The child is asked to repeat the digits in 
their exact order. Credit is given if one or more complete series 
out of three are recalled correctly. 

verag, t level 

ir edm The subject is asked for definitions of words in the 
standard list. Credit is given if 20 or more are defined. 

. Codes. The subject is shown a short sentence and a coded version 
of the sentence. The code consists of interchanging certain letters 
of the alphabet. The subject is asked to figure out the code and 
then write the word "hurry" in the code. Next, another similar 


item is given. Credit is allowed if 114 of the two codes are mas- 
tered. 


2 


. Differences between abstract words. The subject is asked to dis- 
tinguish between three pairs of associated words, e.g., poverty and 
misery. Credit is given if two or more of the distinctions are 
correct. 

. Arithmetical reasoning. The subject is asked to solve 
lems; e.g., “If two pencils cost 5 cents, how 
buy for 50 cent: 
tions. 


three prob- 


many pencils can you 
s?" Credit is given for two or more correct solu- 


. Proverbs, I. The subject is asked to explain the meaning of three 
proverbs, e.g., “A burnt child dreads the fire.” Credit is given for 
two or more correct interpretations. 


. Ingenuity. Three problems are given. An ex 


ample is, a boy is sent 
to the river to ge 


t exactly three pints of water, and he has only 

a 7-pint container and a 4-pint container, How can he measure 
the water? Credit is given if two problems are solved. 

- Memory for sentences, V. The subject is asked to recall a sentence 
after one hearing. Credit is given if at least one of two sentences is 

exactly recalled. 

Reconciliation of opposites. The subject is asked in what way 

pairs of opposing terms are alike, e. g., 


i sick and well. Credit is 
given for three correct answers out of six. 


ANALYSIS OF THE STANFORD-BINET 


With each of the major tests and me: 
described in the remainder of the book 
The comments which are made should not be relied upon solely as the 
basis for selecting tests for specific purposes. When tests are being ob- 
tained, the relevant research lite ö 


rature should be studied (see Chapter 
18). Each critical analysis will provide an opportunity to utilize the 


asurement methods which are 
> critical analysis will be given. 
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principles and methodology of measurement which were presented in 
the first half of the book. In criticizing particular measurement methods, a 
careful distinction should be made between how well an instrument 
measures up to ideals and how well it works in comparison with other 
procedures. In many places we will see that a measure which is crude 
in terms of ideal methods turns out to be much better than the even 
cruder procedures which might otherwise be used. 

What Does the Stanford-Binet Measure. One obvious feature of the 
Stanford-Binet is that it is not a test in the sense in which the term is 
customarily used. Because different persons take different items in terms 
of their age, scores are not directly comparable from age to age. Some 
types of items are used over a wide age range, particularly the memory 
for digits and vocabulary. Other types of items appear in only one or 
several age levels, such as maze tracing, form board, and the interpreta- 
tion of proverbs. 

One of the primary criteria 
of items with chronological 
dure in that whatever “intelligence” 
grow as the child grows. It is, however, by no means safe to assume that 
anything that increases with age is necessarily an index of intelligence. 
Children of twelve years not only have larger vocabularies than eleven- 
year-olds, their feet are longer also. No one would consider using the 
length of a person’s foot as a measure of intelligence. The constructors of 
the scale used not only “age differentiation” to select items, they used 
their intuitions as well. As Binet had done, they selected items which 
looked as though they related to intelligence, as it is popularly conceived, 
and then weeded out items which failed to show the required statistical 
characteristics. : 

The Factor Structure of the Stanford-Binet. Looking more precisely at 


what the Stanford-Binet measures, different investigators have found the 


test to embody at least several factors. This is so in spite of the effort to 


select items which correlate highly with one another. McNemar (29) re- 
ported the results of 14 factor analyses of the Stanford-Binet items. He 


found that a first factor at each age level accounts for from 35 to 50 per 
cent of the item variance. It was his interpretation that the first factor 
is much the same at different age levels. A second factor was found to 
explain from 5 to 11 per cent of the item variance, and a third factor was 
found to account for from 4 to 7 per cent of the item variance. Taking 
these results at face value, they seem to give supporting evidence for the 
existence of one major factor in the Stanford-Binet. Several points, how- 
ever, need to be examined before this conclusion is reached. If only one 
factor were found in the items, this would not mean that “intelligence” is 


atio! 


for the selection of items was the correlation 
age.! This seems to be a reasonable proce- 
connotes to people it is thought to 


* Equally important was the correlation of each item with total score on the test. 
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general but only that the items in the Stanford-Binet had been . 
such a manner as to represent only the common ground of human : 
ities (McNemar had not tried to show otherwise by his analyses ). "T 
McNemar performed his analyses in such a Way as to make the nes 
the common ground in the items. He showed that a general potes 
account for much of the item variance, If one tries, however, to t 
for factors in the Stanford-Binet rather than minimize them, it can i 
Shown that a number of f. sent. A thorough factor analysis 
of the items would nee extensive array of the kinds of tests 
which have led to the Own factors. Also, MeNemar's con- 


] E T or 
clusion that the first factor obtained from the analyses is the same f 
different age levels is open to question, 


Jones (24) offers a different inte 
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actors are pre: 
d to include an 
discovery of kn 


‘pretation of the factor structure of 
Stanford-Binet items. He performed Separate factor analyses of the items 
for children at the ages of 7, 9, ll, and 13 yea ' 

100 boys and 100 girls. Jones made an e 
Structure of the items, Like McNe 
items in the analyses, Without the 


E od 
ts. Each analysis empoy 
ffort to find the complete fac iE 
mar, he useq only the eee 
H » le VT 
inclusion of reference tests of know 
Ficune 10-2, Names and r 


elative influence 
(From L. V, Jones, 2 5.) ns 


24, p. 31 of factors re 


» levels. 
ported at several age leve 
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= 
& 0.20 
zi * Reasoning 
$ 
8 
© O45 
S 
8 
+ 
8 040 Memory Reasoning I 
= Numbers Memory 
a Visualization 
Spatial 
0.05 Reasoning TI 
0 


2 u 13 
Age 

° Factors differing in item content from Verbal and Reasoning factors at ages 

eleven and thirteen years, 


General Intelligence Tests 205 


factor c iti 

5 composition. Ile found a number of separable but correlated 
actors at cach age level. His results are summarized in Figure 10-2. 

less clear-cut results than factor 


Factor analyses of items usually give 
fore, are not clearly inter- 


analyses of whole tests. Jones's factors, there 
ies Aside from the difficulty of interpreting all of the factors clearly, 
à alyses strongly indicate that the factor structure is not the same at 
. levels. For example, although what appear to be reasoning factors 
at ages 7, 9, and 13, nothing of the kind was found in the 11-year 
age group. The verbal factors which predominate at the four age levels 
are less important at some ages than others. 

The answer then to whether the Stanford-Binet embodies only one and 


the s; ; j 
he same factor at different age levels is both yes and no. Much of the 


ite axis r 5 
em variance can be accounted for by one factor. The one major factor, 
age levels, which makes the 


however, is not quite the same at different 
interpretation of IQ scores somewhat difficult. When an effort is made 
to isolate the indicators of more than one factor as Jones did, it is ap- 
parent that the traces of a number of other factors can be found. 

The Scoring System. The Stanford-Binet is one of the few careful at- 
tempts at the construction of an age scale. Although numerous other tests 
employ the mental age and the IQ, few of them meet the statistical re- 
quirements, A successful age scale should have a mean IQ of 100 and 
equal standard deviations of IQs at different age levels for some defined 
Population of persons. In order for the standard deviations of IQ scores 
to remain constant, it is necessary for the standard deviation of mental 
age scores to increase by regular amounts from one age level to the next. 
This is an almost impossible thing to accomplish in test construction. The 
standard deviation of JQ scores on Form L varies from 20.6 at age 2% 
to 12.5 at age 6 years. Special correction tables (29) had to be con- 
structed to make IQ scores comparable at different age levels. 

The great difficulty that is met in achieving the required statistical 
properties of an age scale makes it doubtful that an age scale is worth the 
trouble, The IQ concept bears the unwarranted surplus connotations that 
the general intelligence tests assess intelligence and that a ratio scale is 
available for its measurement. An equally logical and much simpler 
procedure is to use standard scores instead of IQ units. The mental age 
concept has been clung to by clinicians and school workers because of its 
practical meaning and the ease with which it can be explained to laymen. 

The Testing of Adults. On both subjective and empirical grounds there 
is reason to believe that the Stanford-Binet is a better test for children 
than adults. One reason for this is that the concept of general intelligence 
Iren than adults. This point will be discussed 
inford-Binet does not have a high enough 
of superior adults. That is, the range of 


is more meaningful with child 
later in the chapter. The Sta 
ceiling” to measure the ability 


206 Tests and Measurements 


difficulty at the adult level is not sufficient to tap the nee VA ed 
gifted individuals. Because no persons over eighteen vues ag D 
included in the standardization sample, the norms for adults are 25 eta 
In addition, there is a logical difficulty in using the IQ WAS P. 
adults. Because the mental age for an individual stops growing in in " à 
teens, it no longer makes sense to obtain the IQ by dividing ie = 
by chronological age. If this were done, the IQs of adults woulc z : 
stantly drop. What actually occurs in using the test is that for ii 
individual who is sixteen years or over, the chronological age is Sou ae 
to be fifteen years. This results, however, in a number of complicatec 
corrections that must be made to obtain the IQ (47). Even worse than 
the statistical difficulties in assigning IOs to adults, the IQ is a meaning- 
less measure with adults. For example, if a twenty-year-old person has 


an IQ of 150, this does not mean that he has the intelligence of an average 
thirty-year-old. 


Reliability. Very careful research was undertaken to determine the 


reliability of the Stanford-Binet IQ at different age levels (29). An 
equivalent-form reliability estimate was made separately for each age. 
Correlations were found between the Scores obtained on Form L and 
Form M administered to the same subjects within one week's time. In 
general, the findings show that the Stanford-Binet is 
scale, with most of the reliability coefficients e 
Scores tend to be more reliable for Persons in 
young children. The studies also show that 
more reliable than high scores in each age 
more faith can be placed in the pre 
very high score, 

Predictive Efficienc 
to bea good predicto; 
In general, the findings have been th 
the neighborhood of -70 with gramr 


school grades, and .50 with college grades (3, 42). (The decline in valid- 
ity is probably due to the progressively decreasing dispersion of intellec- 
tual ability.) The following correlations were found between Form L IQ 


and high school achievement test Scores (10). The number of students 
ranges from 78 to 200, 


a highly reliable 
qual to or greater than .90. 
their teens than they do for 
low scores are somewhat 
range. In other words, a bit 
cision of a very low score than in a 


y of the Stanford-Binet, The 


test has shown itself 
r of different criteri 


a, particularly of school grades. 
at Stanford-Binet IQs correlate in 
mar school grades, 60 with high 


Reading comprehension 


73 Spelling 46 
Reading speed 43 History 59 
English usage 59 Geometry 48 
Literature acquaintance .60 Biology i 54 


Clinical Utility. Presumably, 


a test like the St 
judged, in the long run, by the 


anford-Binet should be 
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criteria. Many of the uses, however, to which this and other general in- 
telligence tests are put are so subtle as to make direct empirical validation 
difficult. For example, the test might be used to decide what type of 
psychotherapy should be used with a disturbed child. Because of the 
difficulty in measuring therapeutic success and the difficulty in deciding 
the importance of “intelligence” for the outcome, it is hard to determine 
how well the test works in the situation. In many situations only the 
clinical impression of how well the test performs a particular job can at 
present be used as an indication of "validity." Judging from the wide 
acceptance of the test in clinical settings, it is apparent that the Stanford- 
Binet is judged to be as valuable or more valuable than any other test 


for use with children. 


THE WECHSLER INTELLIGENCE SCALES 


David Wechsler, working at the Bellevue Psychiatric Hospital in New 
York, developed an individual test which was intended to differ from the 
Stanford-Binet in the following respects: 

l. The test items were to be more appropriate for adults, and repre- 


sentative norms for adults were to be obtained. 


2. Age levels were to be discarded in favor of a number of subtests 


which all subjects would take. 

3. Separate sets of verbal and performance tests were to be con- 
Structed, allowing for both a verbal and performance IQ. 

4. The IQ was to be determined by a transformation of standard scores 
rather than through the use of mental age scores. 

5. The test would be constructed to fit clinical conceptions of "in- 
telligence" rather than in terms of "age differentiation." 

The Wechsler-Bellevue. The first test in Wechsler's series was presented 
in 1939 (see 52). Like the Stanford-Binet, the logic for the test is mainly 
based on the somewhat misleading concept of "general intelligence and 
on rational rather than empirical validity. Like his predecessors in the 
ral intelligence, Wechsler begins with a definition: 
gate or global capacity of the individual to act 
and to deal effectively with his environ- 
ment" (52, p. 3). Wechsler (52), as late as 1944, held to the belief that 
one common factor underlies intellectual functions and cited Spearman’s 
Work as support for the belief. Very few persons would share that belief 
today, : 

The subtests for the Wechsler-Bellevue were selected by combing the 
test literature for tests that had worked well in practice. After some pre- 
]2 of the tests were selected for special study and 
ousand subjects. One of the tests, Cube Analysis 


measurement of gene 
Intelligence is the aggre; 
Purposefully, to think rationally 


liminary research, 
administered to over a th 
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(see 52), proved to be difficult to explain to subjects and — 5 2 
ences were found. The other 11 tests were found to have desira " íi 15 175 
of difficulty and practicality of administration. Brief descriptions o 
11 subtests in the Wechsler-Bellevue are as follows: 

Verbal Scale 


1 


- General Comprehension. 
- Arithmetical Reasoning. 


- Digit Span. This is the familiar 


- Similarities. The subject is 


h Vocabulary. The subject is 


i i n D n 'ern- 
General Information. The subject is asked 25 questions concerr 
i i : ap 
ing a wide variety of facts. The questions are not intended to tap 
* a d 
ace ic training or specialized branches of knowledge. They 
academic training p 


are meant to cover the kinds of information that any alert ind 
vidual can learn from his cultural contacts. 


The test contains 10 items concerning 


why certain social rules are necessary and how everyday prob- 


lems are solved. 


Ten problems of the kind that would 
be typically encountered in elementary school arithmetic are 
given. Both speed and correctness of response are scored. 
memory for digits which also 
appears at different levels of the Stanford-Binet. From three to 
nine digits are read to the subject, and he is asked to repeat them 
in their exact order, In the second part of the test, the subject is 
asked to repeat the digit series backwards, 
asked to tell what is similar about 


12 pairs of terms, This Subtest is also very similar to material 


found on the Stanford-Binet, 

asked the me. 
Although in the original plan of the W. 
ulary test was included only as an 
in practice that it is usually given 
scale. 


anings of 42 words. 
echsler-Bellevue the vocab- 
alternate, it has worked so well 
as a regular part of the verbal 


Performance Scale 


T: 


10. 


Picture Arrangeme 


- Object Assembly. The sub 


Picture Completion, The Subject is shown 15 incomplete pictures 
and asked to describe the missing part in each. This is also very 
much like material found on the Stanford-Binet, 

nt. The subject is handed a set of pictures and 
them in an order that tells a story. Six sets of 
pictures are given. Both speed and accuracy are scored, 


ject is asked to put together three jig- 
saw puzzles, Each puzzle pictures Some part of the human body. 
Both speed and accuracy are scored, 

Block Design. The subject is shown 
faces of the blocks are painted whit 
The subject is presente 


asked to arran ge 


a set of small blocks. Sur- 
e, red, and red-and-White. 
d with a picture of a design and asked to 
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reproduce it with the blocks. Seven designs are given in turn. 
Both speed and accuracy are scored. 

11. Digit Symbol. This is an adaptation of the familiar coding test. 
The subject is given a sheet of paper on which nine symbols are 
paired with nine numbers. Farther down on the page a jumbled 
list of the numbers is given and the subject is asked to write in 
the matching symbols. 

Administration and Scoring of the Wechsler-Bellevue. Like most indi- 
vidually administered general intelligence tests, only an expert examiner 
should be trusted to give the Wechsler-Bellevue. The test is somewhat 
easier to administer than the Stanford-Binet. The subject is given each 
of the subtests in turn. The individual’s raw score for a subtest indicates 
how many items he answered correctly and/or how quickly the task was 
completed. Tables are available (52) for converting raw scores on each 
subtest to transformed standard scores. The transformed standard scores 
have a standard deviation of 3 and a mean of 10. The means and standard 
deviations were obtained from a normative sample of 1,751 persons 
between the ages of seven to seventy years. 

Three separate IQs can be found from the tests. The six verbal tests are 
added together to form one IQ score. Similarly, the five performance tests 
can be added to give another IQ. Then, the two subscales can be com- 
bined to form one over-all IQ. The IQs are simply transformed standard 
scores, In this case, the transformation at each age level was done in such 
à way as to have the usual mean 10 of 100 and a standard deviation close 
to that of the Stanford-Binet.” Because it was found that raw scores tend 
to gradually decline in going from people of twenty-five to seventy years 
of age, the IQ is determined about the mean raw score for each age group. 

Analysis of the Wechsler. The Norms. All of the subjects in the stand- 
ardization sample were obtained from the State of New York, the majority 
coming from New York City. The 1,751 persons used in the normative 
sample were selected from 3,500 cases in such a way as to represent the 
occupational distribution of white citizens in the United States. There 
Were considerably more men than women in the sample. Although the 
effort to obtain a representative sample was more systematic than that for 
many other available tests, the derived norms are likely to be somewhat 
misleading when applied to the country as a whole. 

It is an open question whether the IQ should be determined about the 
mean of each age group as is done on the Wechsler-Bellevue or about an 
over-all mean. The IQ as determined from the Wechsler-Bellevue has the 


ition of Wechsler-Bellevue 1Qs is slightly smaller than 


Because the standard devi 
IQs obtained from the two tests are not directly 


that of the Stanford-Binet, 
comparable. 
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advantage of not “penalizing” persons of middle age and over. (The rela- 
tionship between age and intelligence will be discussed in the latter part 
of this chapter.) In the absence of definite evidence showing that “general 
ability” as well as test scores decline, the safest procedure is to center 
scores about the mean at each age level. 

The Test Content. It is obvious from an inspection of the subtests that 
much of the material is similar to other tests 
Stanford-Binet. Studies of unselected adolescent 
generally found correlations of .80 (36, 51, 52) and higher between the 
Wechsler-Bellevue and the Stanford-Binet. The similarity, however, in 
content of the two tests, does not distract from the purpose of the 
Wechsler-Bellevue. It was Wechsler’s intention to select a more proper 


range of difficulty in adult items, not necessarily to invent entirely new 
kinds of test materials, 


Davis (14) factor analyzed the 11 Wechsler-Bellevue subtests along 
with reference tests for some of the known factors of human ability. He 
used a sample of 202 eighth-grade pupils in the analysis, Davis found evi- 
dence of a number of factors in the subtests: verbal comprehension, visu- 
alization, numerical facility, mechanical knowledge, general reasoning, 
perceptual speed, and eduction of conceptual relations. The conclusion to 
be reached from this and other factor-analytie studies (1, 5) is that the 
Me is factorially complex, apparently more so than the 


It is an open question whether Wechsle 
including materials of th 
This is an 


and particularly to the 
and adult groups have 


r did right in prominently 
type on the Wechsler-Bellevue. 


: d straightforwardly if the worth 
of the test were entirely determined by how well it predicts dede criteria. 


In general, peformance tests tend to be Somewhat less reliable than 
verbal tests, correlate less among themselves, ang are tall poorer 
predictors of available criteria (mainly, school grades) 4 i " i ae 
of prediction problems is studied, it may be found that th eben assess- 
ments which can be better predicted by the A tei dn by 


e performance 
argument that could be solve 


the verbal tests, 
Reliability. The evidence is th 
parable to that of the Stanford- 
always the case, the individual 
normal adults, retest reliability coefficients for the 
62 to .88 (15). Thus, although the tota] test IO is high] -eliabl axked 
fluctuations are to be expected in some of the EV d T Yao e 
Profile Analysis. It has been a common clinical ene r b- 
tests of the Wechsler-Bellevue in profile analysis to dete * ae ds ti i- 
cal disorders. The underlying assumption is that bind 1 5 rae! 
ogy manifest themselves in different score patterns, Wechsler hypothesized 


at the total test IQ has a reliability com- 


Binet, in general being about 90. As is 
subtests are less reliable. In a group of 


subtests ran ged from 
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(52, p. 150), for example, that the following variations in subtest scores 
characterize the schizophrenic patient: 

Object Assembly much below Block Design 

Very low score on Similarities with high Vocabulary and Information 

Sum of scores on Picture Arrangement plus Comprehension less than 

sum of scores on Information plus Block Design 
Other profile patterns have been hypothesized for brain-damage cases 
and a number of other clinical types. 

Profile analvsis is undertaken with the Wechsler-Bellevue in the situa- 
tion in which it is least warranted—insufficient reliability for the individual 
Subtest and moderate to high correlations among the subtests. Sum- 
maries of the relevant research (36, 51) show that profile analysis efforts 
with the subtests have met with essentially negative results. Studies of 
the relationship between test profiles and other human attributes should 
be based on the results of well-standardized multifactor batteries, not on 
the subtests within general intelligence tests. 

Predictive Efficiency. Basing his test as he did on a rational 
to the measurement of intelligence, Wechsler (52) reports only a few 
specific correlations with performance criteria, and these are based on 
small numbers of cases. Other studies report correlations in the .40s and 
50s between Full Scale IQ and freshman college grades (3, 42). The 
Separate Verbal Scale IQ had higher correlations than Full Scale IQ with 
grades; the Performance Scale proved to be a much poorer predictor of 


freshman grades. 


approach 


THE WECHSLER ADULT INTELLIGENCE SCALE (WAIS) 


To meet the criticisms which were made of the Wechsler-Bellevue, a 
revision of the test, WAIS, was published in 1955 (54). All the original 
subtests were retained but many new items were substituted for less 
s. The instructions for administering and scoring the 
fort was made to obtain a wider range of 
item difficulty in each subtest, which should provide more adequate test- 
ing of persons at both extremes of the ability range. The norms for the 
WAIS are much more representative of the United States population 
than were those for the Wechsler-Bellevue. A carefully balanced sample of 
1,700 persons was tested in 24 communities spread across the country. 
The sample resembles closely the United States population in terms of 
urban and rural, men and women, geographic region, and occupation. 
The test manual (54) is more frank, clear, and complete than was that for 
the Wechsler-Bellevue. Although it will be some time before the ultimate 
Worth of the test can be determined, the indication is that the WAIS will 


be a better test than the W. echsler-Bellevue. 


Satisfactory old one 
test have been revised. An e 
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THE WECHSLER INTELLIGENCE SCALE FOR CHILDREN (WISC) 


Although the Wechsler-Bellevue provides norms for children ae 
to the age of ten years, it is primarily an adult test. A separate test, the 
WISC (53), was constructed specifically for children, aged five to fifteen 
years. The general organization of the test is much like the parent form: 


Verbal Scale Performance Scale 


- General Information 


1 6. Picture Completion 
2. General Comprehension 7. Picture Arrangement 
3. Arithmetic 8. Block Design 

4. Similarities 9. Object Assembly 

5. Vocabulary 10. Coding (or Mazes) 
A] 


Iternate: Digit Span 


The subtests are similar in most cases to those in the Wechsler-Bellevue. 
The tests, however, are easier and the materials 
to the child's world. Digit Sp 
five verbal tests. The e 
Mazes as a tenth test. 

Deviation IQs are obtained for each separate 
supplied for the conversion of raw scores 
interval from 5 to 15 years. The 


are more directly related 
an can be used as an alternate to one of the 
xaminer has the choice of using either Coding or 


age level. Tables (53) are 
to IQs for every four-month 
conversion formula was determined in 
such a way that the mean IQ for each age level in the standardization 
sample is 100 and the standard deviation is 15. The manual (53) gives 
the percentage of children that fall at different IQ levels, with a verbal 
description of what particular scores mean, (See Table 10-2.) 

TABLE 10-2, CLASSIFICATION OF IQs ON THE WISC 
(Adapted from Wechsler, 53, p. 16) 


—— ee nu 


Description IQ ranges Per cent of 
children 

Very superior 130 and above 2.2 
Superior 120-129 6.7 
Bright 110-119 16.1 
Average 90-109 50.0 
Dull normal 80-89 16.1 
Borderline 70-79 6.7 
Mental defective 


69 and below 


The WISC is a much more carefully standardized test than the 
Wechsler-Bellevue, Particular care 


went into the gathering of norms. One 
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hundred boys and 100 girls were tested at each age level, giving a total 
of 2,200 children in the standardization sample. A strenuous effort was 
made to choose a representative cross section of white children in the 
United States. The sample was drawn from 85 communities in 11 states. 
The distribution of subjects closely resembled the country at large in 
terms of urban-rural proportion, geographical area, and parental occupa- 
tion. The WISC standardization sample is as representative as that used 
in any current general intelligence test. 

Analysis of the WISC. Because no alternate form of the test is available, 
it was necessary to use split-half reliability estimates (53). The reliability 
was carefully studied at age levels 715, 1014, and 13% years. The Full 
Scale reliabilities at these age levels were found to be .92, .95, and .94 
respectively. For the Verbal Scale, the reliabilities were found to be .88, 
-96, and .96 respectively. As is usually the case, the reliabilities were 
slightly lower for the Performance Scale: .86, .89, and .90. In general 
these coefficients represent good reliabilities for the three IQ scores. Here 
again the lesson is learned that high reliability of composite scores does 
not mean that subtest reliabilities are all high. Although most of the sub- 
test reliabilities are in the .60s, .70s, and .80s, some go as low as .50. This 


should be a warning to anyone who tries to consider profile differences 


among the subtest scores. It is safe enough to compare verbal IQ with 
performance IQ, and this should have diagnostic meaning for the child 
Who suffers a language handicap, but no finer inspection of the differences 
among subtests can be trusted. 


Several studies found correlations between the Stanford-Binet and the 


WISC ranging from the .60s to the .90s. Correlations with the Stanford- 
Binet are lower with the WISC Performance Scale than with the Verbal 
Scale. The test is too new to know what predictive validity it will have 
for different performance criteria. Few criteria are available for children 
except school grades. Considering the appearance of the test materials, 
the reliability, and the correlations with other tests, it is to be expected 
that the test will show about the same predictive efficiency for school 
grades as that afforded by the Stanford-Binet. Although the test is cur- 
rently receiving generally good response from clinical settings, it takes a 
considerable amount of experience with 
whether or not it has clinical value and practicality of use. 


a new instrument to know 


GROUP TESTS OF GENERAL INTELLIGENCE 


Like the individual tests of general intelligence, the group tests are usu- 
ally composed of verbal comprehension, numerical computation, and vari- 
ous mixtures of the reasoning factors. Although the tests differ from one 
another in appearance and sometimes in their factor composition, they 
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tend to correlate highly with one another. At the teen-age and iru) doen: 
the group tests correlate highly with the individual tests, such as 
Wechsler-Bellevue. We see the interesting result th 
off in seemingly different directions to compose in 
up with rather similar measures. Where th - 
another is in their practical advantages. Some are longer and thus more 
reliable than others. Some have obtained norms in a careful and repre- 
sentative manner; others have only scant or misleading information on 
norms. Some have either higher or lower "ceilings," making them more 
useful with one or the other extreme of ability. Considerable research has 
been done with some of the tests, and only “face validity" can be claimed 
for others. 

The Binet and Wechsler scales dominate the field of individual tests of 
general intelligence, Among group measures neither one nor several tests 
have dominated the field. The tests, therefore, which will be discussed 
here are only examples of the measures available, 

Group Tests for Young Children. The 


proved feasible to use group tests are the five. and six-year levels. Only 
small groups of approximately a dozen children can be tested in this way, 
and even then the examiner must exercise considerable skill to obtain the 
necessary cooperation and attention. Tests at this age level cannot employ 


written language and the child cannot be expected to write his own 
responses. Test instructions must b 


€ given orally and supported by illus- 
trations and gestures, 

One of the most widely used tests for young children is the Pintner- 
Cunningham Primary Test (34), which has been in use for over twenty 
years. The test is available in three equivalent forms, A, B, and C. Each 
form is composed of seven Subtests which are added together to obtain 
one score. (Illustrative 


items are shown in Fig. 10-3.) 
Equivalent form reliabilities are found to be gene 


of kindergarten and first-grade children (34) 
Correlations between the Pintner. 

are usually about .80 (34). In a 
Pintner-Cunningham correla 


at people who started 
telligence tests ended 
e group tests differ from one 


youngest ages at which it has 


rally high for groups 


> ranging from .83 to .89. 
Cunningham 


res on a reading test. Other 
young children are the 
and the Otis Alpha (31). 

Level. As children progress 
and more written material can 
ades it is still necessary to rely 
; est materials. One of the oldest 
tests available for the elementary School range is the National Intelligence 
Test (22). The test was developed soon after the First World War in the 
wake of the successful use of group tests with Armed F orces personnel. 


several gr; 
and pictorial t 
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Ficure 10-3. Illustrative items from Form A i i 
j of the Pintner-C h i „ 
Test. (Reproduced by permission of World Book Company.) e 


Test 2. Mark the prettiest gir! 


A= 


Test 3. Mark the two things thot belong together 


Test 7. Look at how each picture is drown; moke another one like 


it in the dots 
Scale A and Scale B. Each scale 


The test comes in two similar parts, 
btest. A total score is obtained by 


has four verbal and one numerical su 
adding the subtests together. Equivalent forms have been constructed for 


each scale. The test relies heavily on verbal comprehension, utilizing such 
subtests as verbal analogies, opposites, and sentence completion. The 
test was used for many years as à standard instrument; but because it has 
not been revised in a long time, it is now generally being replaced by 
more recently developed tests. Other tests which are used with elementary 


school children are the Kuhlmann-Anderson Intelligence Tests (27), the 
Otis Beta (32), Lorge-Thorndike Intelligence Tests (Houghton Mifllin 


Company ), Henmon-Nelson Tests of Mental Ability (Houghton Mif- 
flin Company ), and the California Test of Mental Maturity (45). 
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Group Tests for High School Students and Average Adults. Se 
intelligence tests for teen-agers and above tend to rely heavily on verba 
comprehension and general information, with only small inclusion of 
other factors. Some of the tests are designed primarily to predict future 
achievement in school. These are usually weighted heavily with ques- 
tions on general information and look much like achievement examina- 
tions. Other tests are designed primarily to test aptitude—not what the 
individual knows but his ability to learn. The two kinds of tests, however, 
tend to correlate highly. As was previously mentioned, tests of intelli- 
gence, or aptitude, tend to rely heavily on the individual’s knowledge of 
the culture about him; thus, they are not very different from tests of 
acquired knowledge. There are exceptions: persons who score consid- 
erably lower on tests of information than on intelligence tests; and for 
these persons, the difference in Scores on the two ‘kinds of tests is of 
diagnostic importance, 

One of the oldest group tests for teen-agers and average adults is the 
Army Alpha, which did so much to popularize Sroup tests after the First 
World War, Civilian adaptations and revisions of the test (55) are still in 
use. The Pintner, Otis, and Kuhlmann-Anderson series, mentioned in 
respect to the testing of younger children, have forms with high enough 
“ceiling” for the testing of average adults, i 

One of the leading tests for this age level is the Terman-McNemar Test 
of Mental Ability (48). It has seven subtests, all of which are predomi- 
nantly concerned with verbal comprehension. (See Figure 10-4 for illus- 
trative items.) Norms were carefully determined on a sample of persons 
from 200 communities in 37 states. Conversion tables are used to express 
scores in terms of percentiles, mental ages, and deviation IQs. 

Per ied s ty te * ro at this age level is the Army Gen- 
Second World War. The lest ig Sm ien e. Ae ee oo 15 
test includes an equal pro sition af cue 2 Vicini 1 ? 9 5 
The: Sort ap dude 2 12 ; " i verbal, numerica], and spatial items. 
at least severa] factors in an intelligence test is a 
departure from the usual emphasis on verbal comprehension alone. 
e tower eee hr Adult, Tests in this 
o À xt evel, either emphasize aptitude or 
general information, As the higher levels of education and ability are 
eee Lie UO nerally i d ek becomes more of 
kets) RE s any IU is much easier to compose 
adequate tests for idiots th and the diffic sually 
mounts when going from t j hi 5 
classification tests have eee Seme, Gan 
grades, with correlations in the 50s o iin ea Pi dieting e 
> 305 neommon. It has been a par- 
de of Professional test builders that they 


Gener. 
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Ficunr 10-4. Sample items from the Terman-McNemar Test of Mental Ability. (Re- 
produced by permission of World Book Company.) 


| = 


Test 1. Information 


Mark the word that makes the sentence TRUE. 
Our first President was— 
1. Adams 2. Washington 3. Lincoln 4. Jefferson 5. Monroe 


Test 2. Synonyms 
Mark the word which has the same or most nearly the same meaning as the first 
word. 
Correct I. Neat 2. Fair 3. Right 4. Poor 5. Good 


Test 3. Logical Selection 
Mark the word which tells what the thing ALWAYS has or ALWways involves. 
A cat always has— 
1. Kittens 2. Spots 3. Milk 4. Mouse 5. Hair 


Test 4. Classification 
In each line below, four of the words belong together. Pick out the ONE WORD 
which does not belong with the others. 


l. Dog 2. Cat 3. Horse 4. Chicken 5. Cow 


Test 5. Analogies 
Hat is to head as shoc is to— 
l. Arm 2. Leg 3. Foot 4. Fit 3. Glove 
Test 6. Opposites 
Choose the word which is OPPOSITE, or most nearly opposite, in meaning to the 
beginning word of each line. 
North 1. Hot 2. East 3. West 4. Down 5. South 


Test 7. Best Answer 
you think is best. 


Read the statement and mark the answer whicl 
basket because: 


We should not put a burning match in the waste 
2. We might need a match later. 


1. Matches cost money. 
4. It might start a fire. 


3. It might go out. 


have generally met with only small success in predicting the more creative 
side of human behavior as it is manifested in graduate school and beyond. 
Among the most widely used tests at this level are those which are 
y 
designed specifically for the selection of college freshmen. One of the 
best known of these is the American Council on Education Psychological 
Examination for College Freshmen (ACE) (56, 57), which was used for 
- : 
many years but has now been discontinued. A new form of the test was 


constructed each year for use with college applicants. Three of the six 
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bine to form a linguistic score (L), and the other three foim 
"lE score (Q). The L subtests consist of Verbal Analogies, 
een, and Same-Opposite. The O tests consist of ne e 
soning, Number Series, and Figure Analogies. Scores on parts 1 R 
were usually added together, and the composite used in the se ecti id 
students. The ACE has high reliability, with usual retest coefficien S 4 
around .90 (58). Other tests which are similar in purpose and solidity o 
construction to the ACE are the College Board Tests (60), the College 
Qualification Tests (The Psychological Corporation), and the Ohio State 
University Psychological Test, Form 21 (Ohio College Association). T 
The Miller Analogies Test (30) is an interesting effort to predict suc- 
cess in graduate school and in high-level professional work. The test cons 
sists of 100 verbal analogies ranging widely in difficulty, The analogies 
are cleverly constructed in such a way as to measure both verbal compre- 
hension and reasoning ability. The major advantage of the test is that it 
has a high enough “ceiling” to distinguish among persons who would 
make much the same score on the usual intelligence test, Split-half reli- 
abilities for the test are generally found to be over .90, The Miller Analo- 
gies Test has been used frequently to predict grades in graduate school. 
The predictive efficiency of the test varies widely in terms of the school 


and the academic specialty being studied, Validity coefficients tend to be 
about .50 on the average. Another test whicl 


1 is both well constructed and 

capable of differentiating at high levels of ability is the CAVD (49). 
Achievement test Scores in college and course grades are generally as 
valid as the intelligence tests for Predicting graduate school and profes- 
sional performance (some of the achievement tests which can be used for 
this purpose will be mentioned in Chapter 12). A combination of the 
three kinds of measures by multiple regression usually makes for more 

efficient prediction, 

GENERAL INTELLIGEN 
AND PRESCH 
A separate section has been res 


and preschool tests because of 
of the pioneers i 


CE TESTS FOR INFANTS 
OOL CHILDREN 


erved here for the discussion of infant 
ecial problems that are involved. One 


ants is Arnold Gesell. For over 20 years 
he and his colleagues have pe 


med longitudinal Studies of child devel- 


were prepared. They are j 
l. Motor behavior. How wel] the 
nate, stand, walk, and m 
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2. Adaptive behavior. How well the child can solve the problems of his 
small world: obtain objects, remove obstacles, solve puzzles, and 
react to stimuli. 

3. Language behavior. How well the child can communicate, using the 
word in its broadest meaning, including the use of gestures and 
primitive words, to the later development of real language. 

4. Personal-social behavior. How well the child learns habits of per- 
sonal care such as toilet-training, dressing, and feeding himself. At 
a later age consideration is given to how the child manages himself 
in social situations and in play activity. 

During the first year of life, when the Gesell scales would supposedly 
have their unique value, most of the observations have to be made about 
motor behavior. The 4-week old infant cannot, of course, talk or follow 
oral instructions of any kind. The most that can be done at the infant 
stage is to watch the child’s spontaneous movements and note how he 
reacts to various stimuli. At 1.4 months the average child can coordinate 
his eyes on an object held before him. At 3 months he will make reach- 
ing movements for an object. At 5.5 months he will react differently to 
Strangers than to his parents. (See Figure 10-5 for illustrative test mate- 


ith the Gesell Developmental Schedules. (Cour- 


Ficunz 10-5. Test materials used w 
tesy of The Psychological Corporation.) 
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ials.) Other tests for infants are the Cattell Infant dn ee cies 
(11) the California First-Vear Mental Scale (7), and the Northwester 
Infant Intelligence Tests (18). , 
Analysis of Infant Tests. Tests for infants 
administer, and score. They are, of course 
tests are less reliable than tests for older 
siderably lower during the first six months 
of different tests have found reliabilities 
first six months (6, 11). After 6 mon 
respectable figures in the .80 to 90 ra 
weeks and months, the infant tests do 
The question is, what do they me 


are difficult to standardize, 
all individual measures. Infant 
children. The reliability is con- 
than afterward, Several studies 
around .65 for testing during the 
ths the reliabilities move up to 
nge (6, 11). Except for the first 
measure something consistently. 


: Tue EUM 
asure? A rea] difficulty in 1 

1 ; z d 
infant scales is that there are almost no criteria available until the chi 


enters school. A customary procedure has been to correlate infant tests 
with scores made several years later on more established intelligence tests 
like the Stanford-Binet. Studies (8) have shown that infant tests given at 
the age of one year or less correlate about Zero with intelligence tests 
given 5, 10, and 15 years later to the same persons. It is obvious that 
infant tests do not measure intelligence as it is customarily measured E 
older children, The key to this dilemma seems to be that the infant scales 
primarily measure motor and Sensory ability, and the research on relia- 
bility shows that there is some consistency in the development of these 
attributes, It is quite likely that the infant tests would predict motor and 


sensory skills later in life, but interestingly enough, almost nothing has 
been done to test this hypothesis. 


Preschool Tests, Between the ages of 2 and 5 the d 
processes become accessible to psycholo 
velops speech, can manipulate objects. 
world about him, he can be 
customarily used in intelli 


eveloping intellectual 
gical tests, After the child de- 
and becomes acquainted with the 
tested with some of the materials that are 


gence tests, However, many of the difficulties 
in test standardization and administration still remain. The test materials 


must be largely Pictorial or consist of performance problems, 

The difficulty in testing infants is that they usually do little one way or 
the other to indicate the status of their intellectual abilities. Children 
between the ages of 2 and 5 do too much, They are so active and dis- 
tractable that it is difficult to carry on any formal testing procedures. 
Many children in this age group are shy with strangers and will give little 
if any cooperation to the examiner, They are often not highly motivated to 


impress the examiner or themselves with how well they can perform. 
Consequently, the test must be posed as an inte 


resting game to the child, 
and much depends on the examiner’s skill. 
One of the most Prominent tests for young children js the 
Preschool Scale (20, 21). Ty 


Minnesota 
lere are two equivalent forms, 


each with 
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Ficune 10-6. Test materials used in the Minnesota Preschool Scale. (Reproduced by 


permission of Educational Test Bureau, Minneapolis. ) 
26 items. Some of the items are as follows (see Figure 10-6 for an illus- 
tration of some of the testing materials): 

Pointing to parts of the body on a doll 

Telling what a picture is about 

Naming colors 

Digit span 

Naming objects from memory 

Vocabulary 

Copying simple geometrical designs 

Block building 

Jigsaw puzzle 

10. Indicating missing part in pictures 

Many of the items are similar to those at the lower age levels of the 


Stanford-Binet, The instrument is largely a power test with no emphasis 
with motor skills. Tests at this 


on speed, and the items are little concerned 
age level which depend on speed and motor skills are probably poorer 
measures. ^ D 

The Minnesota scale was standardized on a group of 900 children rang- 
ing in age from 1!4 to 6 vears. Equivalent-form reliabilities of the total 

2 ) à 

scale vary from .80 to .94 (20, 21). There are some reasons to believe that 
the Minnesota scale is not an entirely adequate measure below the age 
of 3, Although scores for children above 3 tend to correlate highly with 
Stanford-Binet scores obtained later, the correlation for children below 


O 9 D N E a 
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i £ inical experience indicates that some 
a f 3 is only .21 (21). Also, clinica pe ; | 
5 ae vare are not sufficiently interesting to hold the ee p 
1 below the age of 3. Two other preschool tests are the Da 
8 555 Test for Young Children (50), and the Merrill Palmer Scale (44). 


PERFORMANCE AND “CULTURE FAIR” TESTS 


ave been criticized for their 
a test like the Stanford-Binet, 
g the ability of the deaf child. 
er's instructions and respond to 
difficulty is also at a disadvan- 
even if he knows the answers. 
s language may be classified as 
ave above average ability when 


The child must be able to hear the examin 
what is said. The child who has a speech 


rd-Binet yeth 
language. 
Performance Tests, Tests of the performance type were developed to 
circumvent the traditional emphasis on verba] comprehension. In a per- 
formance test the Subject does not necessarily have to read, write, speak, 
or understand spoken language (the instructior 
mime). One of the most widely used te. 
Paterson Performance Scale (35). The s 
10 of these which are considered the best 
Brief descriptions of the 10 items are 
l. Mare and Foal. A picture is 


ns can be given in panto- 
sts of this type is the Pintner- 
cale contains 15 items, but the 
are used in most practical work. 
as follows: 
presented showing a mare and a foal 
ina country setting. Some cut-out pieces are removed from the 
picture. The Subject must place the pieces in their proper places. 
Seguin Form Board. A wooden board is presented with 10 cut- 


out geometrica] forms, The subject is asked to place the forms in 
their proper places, 


3. Five-figure Form B 
more difficult bec. 
eral parts, 


1 


oard. Similar to 


the Seguin Form Board but 
ause each geome 


s g "me . sev- 
trical form is divided into sev 


ard which proves more 
figure Form Board. 

; à orm board also uses geometrical de- 
signs Which are divided i ? Several pieces, It is made difficult 
because of the close si ity between Some parts of the cut-out 
geometrical forms, 


arms, legs, head. 
7. Feature Proſile. 


( e used to assemble the profile 
of a human face (see Figure 10-8). 
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Ficurr 10-7 Manikin Test. (From Pintner-Paterson Performance Scale; courtesy of 


C. II. Stoelting Company.) 
8. Ship Test. A jigsaw puzzle of a ship is put together (see Fig- 


ure 10-9). : De 

9. Healy Picture Completion, I. Cut-out sections are removed from 
a picture of children playing. The subject is asked to place the 
cut-out sections in their proper places. 

10. Knox Cube Test. A test of immediate memory involving four 


wooden cubes. The examiner taps the cubes in a sequence and 
asks the subject to do likewise. 


Ficure 10-8, Knox-Kem 


pf Feature Profile Test: 
(Courtesy of C. II. Stoelti 


Pintner. Paterson Modification. 
ng Company. ) 


Ficunz 10-9. Ship Pictur 
courtesy of C. H. Stoeltin 


* Form Board, (From pi 8 seule: 
Company.) m Pintner-Paterson Performance 
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Credit is given on the items for both speed and accuracy. Many of the 
performance tests are so easy for the average person that only through 
the “speeding” of performance is any dispersion of scores obtained. The 
emphasis on speed is probably a weakness of the Pintner-Paterson and 
other similar performance tests. Speed of response is conditioned by age, 
subculture, and personality. Although the point is not settled, it is doubt- 
ful that a heavy emphasis on speed in general intelligence tests leads to 
good measures. 

The Pintner-Paterson is composed almost entirely of jigsaw puzzles in 
one form or another. The test standardization and the obtaining of norms 
are crude in comparison to tests like the Stanford-Binet. The reliability 
of the test is considerably lower than that for most “verbal” tests. Corre- 
lations between the Pintner-Paterson and traditional intelligence tests 
are usually low. Other performance scales are the Arthur Point Scale 
(4) and the performance scales on the Wechsler tests. Both of these per- 
formance scales are generally more satisfactory instruments than the 
Pintner-Paterson. 

Analysis of Performance Tests. The seem 
ing performance measures of general ability has not borne fruit. The tests 
tend to be too much alike and more in the nature of incidental games than 
measures of important functions. In general the scales have lower reli- 
ability and less validity for the available criteria than do the traditional 
"verbal" tests, There is little reason to believe that performance tests are 
"better" to use with children in general. The remaining use for these 
measures is in testing the individual with a language handicap of some 
kind. If there is a marked difference in performance by such persons 
between performance tests and "verbal" tests, it is an important diagnostic 
clue. Fewer research funds and much less energy have gone into the con- 
Struction of performance scales than into “verbal” intelligence tests. Per- 
haps a more strenuous research effort would produce more useful per- 
formance tests. 


“Culture Fair” Tests. Objections have 
sis on language in traditional intelligence tests but to the inclusions of 
Sua 


cultural artifacts of all kinds. For example, the Zuñi Indian child or a 
child reared in the Tennessee mountains might have difficulty in identify- 
ing a locomotive, interpreting 2 picture about sea life, or answering ques- 
tions about how city people live. The advent of radio, television, and other 
arms of the mass media of communication has tended to provide all 
people with a common set of cultural symbols and information. However, 
the opportunity to absorb cultural information is a matter of degree. 
Everyone is to some extent isolated in a subculture of his family, friends, 
religion, and geographical environment. Living in some subcultures rather 
than others makes a difference in the ability to score well on traditional 


ingly good idea of construct- 


been made not only to the empha- 
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intelligence tests. The person in the crossroads of our culture, the urban, 
highly “Americanized” individual, is given the most advantage. oe 

In order to have measures which are “culture free” or “culture fair, 
tests have been composed which not only make very little use of language 
but are little dependent on the symbols and information of the culture. 
The items on such tests are generally composed of geometrical designs, 
simple pictures, and abstract symbols. One of the better-known tests of 
this kind is the Progressive Matrices Test developed in England by Raven 
(37). The test consists of 60 designs, or matrices, each of which has a cut- 
out section. The subject is shown six to eight 
and asked to indicate which should be pl 
ure 10-10 for illustrations of the items), 
standardization data on the test in Engl 
similar norms on children in Ar 
extent, "culture fair." High co 


alternative cut-out pieces 
aced in the matrix (see Fig- 
Raven has collected extensive 
and. Rimoldi (38) obtained rather 
gentina, suggesting that the test is, to arı 
trelations have been found between the 
Progressive Matrices and "verbal" tests. A factor-analytic study (39) 
shows that the test measures some components of the reasoning factors, 
most likely emphasizing eduction of relationships, plus elements of the 
spatial factors in some items. The Progressive Matrices gives promise of 
being an important instrument for testing persons with a language handi- 
cap and for cross-cultural comparisons. Other "culture fair" tests are the 
Semantic Test of Intelligence (41), the Culture-free Test of Intelligence 
(12), and the Leiter International Performance Scale (28). 

The "culture fair" tests have not been studied sufficiently to estimate 
their eventual worth. The critica] question is whether the type of tests 
being used can measure the broad functions which underlie human abil- 
ities. The performance tests have overspecialized in the measurement of 
speed, spatial factors, and motor skills. The “culture fair” tests have 
apparently been able to tap a broader Tange of functions particularly 
some of the reasoning abilities, However, it is questionable whether a test 
wi ie be comprehension will be ultimately the ps 
availible: citere. Je eo She has are still better predictors of t F 
beyond school is studi 2 at as performance in real-life situation 

ORG sed ucied more fully the culture fair” type of test will 
come into its own, YP 

It would seem that an even more valid test could be developed if some 


way were known for measurin i 
ay a B aptitude for v fone 
distinct from the Status of verbal à ise E kiero 


only way that is presently ay 
through vocabulary tests a 
an individual would appear “intelli 
lectual goals if he could not ey 
acquaintance with the culture, co 


nd measu: 


accomplish intel- 
aining and wider 
anguage and its usage 


Ficure 10-10. Two items from the Raven Progressive Matrices Test. (Reproduced by 
permission of J. C. Raven.) 
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THE NATURE OF “GENERAL INTELLIGENCE” 


It has been shown that in spite of the separable factors that underlie 
tests of human ability there is a common ground among ability functions 
that can be reliably and usefully measured. The “verbal” intelligence tests. 
both individual and group, generally correlate highly enough with one 
another for us to speak of common characteristics. If a marked differ- 
ence in test scores is found between two 
tests, the difference would most likely be reflected in the others. The wide 
use of intelligence tests during the last half-century has shown a number 
of things about the underlying process: 

1. Intelligence cannot at present be measured in children below the 
age of two years and not very well below the age of five or six years. 
Earlier in the chapter the difficulties of constructing tests for infants and 
preschool children were discussed. Perhaps the next decade will show an 
improvement in the early measurement of general ability, 

2. The concept of “general intelligence” is more meaningful with chil- 
dren than adults, Although the evidence on this point is somewhat con- 
flicting, it seems that abilities are more "general" in children. Comparative 
factor analyses of children and adults tend to show that there is more of a 
tendency toward one general factor in children, Some of the factors which 
are found in adult Populations are difficult to find at all in children. This 
may be due in part to the fact that different test materials have to be 
used with children than with adults, However, the weight of the argu- 
ment is that abilities are more “general” jn children, The factorial 
diversity of abilities in adults is probably due 
and different kinds of school and vocational tr: 
to use general intelligence tests with childre: 

3. Intelligence as measured by current “verbal” tests is partly due to 
heredity and partly due to environment, The most telling arguments for 
this position come from the studies of resemblance between family mem- 
bers in intelligence. Conrad and Jones ( 13) administered intelligence tests 
to over 200 families in rura] New England, They found that for children 
above the age of five years intelligence of parents and children correlates 
49. Numerous other studies have found Correlations very close to .50. 
Conrad and Jones also found à correlation of 49 between siblings (be- 
tween brothers, between sisters, or between brothers and sisters). Rob- 
erts (40) pointed out that a correlation of 50 between siblings is what 
would be expected from multifactor inheritance, It js ible that the 
correlations which have been found between family so could be 


due to environment rather than heredity, Family members tend to share 
the same environment, talk about topics in com 


kinds of people with one of the 


to different life experiences 
aining. It makes more sense 
n than adults, 


mon, and have similar 
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kinds of schooling. Environment may explain part of the resemblance, 
but studies of twins (see 2) make it apparent that this is not a complete 
explanation. Correlations between the intelligence test scores of fraternal 
twins (dizygotic) are usually higher than between siblings, usually rang- 
ing from .50 to .70. Correlations between identical twins (monozygotic) 
are usually around .90—almost as high as the reliability of the tests! Fra- 
ternal twins can have very different genetic structures, but the genetic 
structures of identical twins are exactly alike. This leaves little doubt that 
at least a portion, and apparently a sizable portion, of intelligence is due 
to inheritance. ! 

4. After the age of about six ye 
remain stable in respect to his age group. That is, 


ars, the individual's intelligence tends to 
the superio? people at 


est interval on prediction of 


al testing and test-ret 
m Honzik, McFarlane, and 


Ficunr 10-11. Effect of age at initi 
st. (Adapted fro 


io IQ from earlier te 


1.00 
t 
8,9 yr tests d bid p 
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Correlation between tests 
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Age at loter test 
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superior people at other age levels (see Figure 
point). The relationship is far from perfect, 
drastic changes over a period of years. 

5. From the test results obtained at one period in ie, arn Wes 
increases with the age of the persons tested up to the late e d »e intel- 
ligence level — decreases throughout adulthood and old age (see 


^r age level tend to be the 
0-11 for evidence on this 
and isolated individuals may show 
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Figure 10-12). This has been erroneously interpreted to mean that each 
person’s intelligence grows and then declines, This is not at all what 
Figure 10-12 shows. It shows the scores of different persons of different 
ages tested at about the same time. There are a number of different ways 
of explaining the trend in addition to the one that older people decline in 
intelligence. The older people probably had less schooling, were reared 


Ficure 10-12. Age changes in full-scale standard scores on the Wechsler-Bellevue 
Intelligence Scale. (Adapted from D. Wechsler, 52 p. 29.) 
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in diffe imes T A 
x aa and different environments, and reacted differently to 
"T Fi Se ( particularly to highly "speeded" tests). It will be a 
Wil de ee * growth and decline of intelligence for individuals 
pensent 15 Nevertheless, it is important to note that older people 
X Sam fidi. wil n sien than their younger contemporaries, Whether 
E d Wil! be obtain a 
interesting speculation, ed ten or twenty years from now is an 
6. There are definite 
scores on the Vide telligence test scores. Lower 
people living i / ex nn. f low socioeconomic status, 
pn of the Matted bere People living in the Southern or Southwestern 
and Negroes, Til ates, immigrants from southern Europe, and Indians 
large strain on the bpm of differences of this kind places à 
Cal fo ati " " e. 
eased previously in FIN d. undation of intelligence tests, As was dis- 
„Vabter, the traditional measures of intelligence 


are constructed in such ; 
a Way as g ' certai i is 
y as to favor certain groups. The question 1S 


whether or not the 8 
make high scores if afforde eke lower scores would 
^ antages of the wider culture. There 


: ; ; Y would. It we i th- 
em Negro children who migrate to New Douce idee 


longer they are in the ci 
ne city 
sidered is that even thoug ; Another point that should be con- 
tlian others ak lea ee, , SrOUpS score lower on the average 
The whole RES ey high-scoring individuals can be found in each 
q of ethnic and Socioeconomic differences is highly 
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charged with emotion in both professional and lay circles and is a point 
about which much more information needs to be obtained. 

7. Many different kinds of attainment involve intelligence as measured 
by traditional “verbal” instruments. In particular, intelligence is one of 
the major factors in successful school work. Numerous correlations of .50 
and above between particular tests and school grades were cited in this 
chapter. Intelligence test scores differentiate occupational groups (see 
Table 10-3). However, it should be noted that there is considerable over- 


TABLE 10-3. AGCT STANDARD Scores or OCCUPATIONAL Groups 
IN THE SECOND WOL Wan 


(Adapted from N. Stewart, 43) 


Percentile 
Occupational groups 10 95 50 75 90 
Accountant 114 121 129 136 143 
Teacher 110 117 124 132 140 
Lawyer 112 118 155 Ion x 
Bookkeeper, general 108 114 im 199 A 
Chief clerk 107 114 122 131 1l 
Draftsma 99 109 120 127 137 
Postal eh 100 109 119 126 136 
Clerk 97 108 E 125 133 
erk, general 97 7 5 ; 
Rado 8881 97 108 117 125 136 
adio repairman = 115 125 133 
Salesman 94 107 5 25 3 
5 2 133 
Store manager 91 104 9970 1 E 
Toolmaker 92 101 112 123 129 
S OM z 99 110 120 127 
tock clerk 85 * 127 
V 86 99 110 120 27 
bon 109 118 128 
Policeman 86 96 £ 2 
¢ 8 124 
Electrician 83 96 109 En 155 
Meate: 80 94 108 7 2 
oe 5 107 117 126 
Sheet metalwork 82 95 07 : m 
TRE T oe 77 80 103 114 123 
a une operator ae 89 102 114 122 
utomobile mechanic 1o 
2 86 101 113 123 
Bar en general a 83 99 118 123 
Truk 7i 83 98 111 120 
Tuck driver, heav: 3 9 120 
» heavy 79 96 111 2 
LM - "5 93 108 119 
Laborer 65 16 £ 
7 t 0 2 
Barber 66 79 35 109 120 
: z 75 87 103 119 
Miner 67 : ; M 
Far; 4 61 70 86 103 115 
Aree 70 85 100 116 


Lumberjack 60 
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lap between most groups. Comparing the extremes in Table 10-3, 2 
10 per cent of lumberjacks score higher than the lower 10 per ip E 
accountants. Scores on intelligence tests are predictive of SUCCESS id ; 
many but not all jobs (see Table 10-4). The fact that the sig eso 
between test scores and job success are near zero in some cases does nt 
necessarily mean that intelligence is not important for the job. The e 
sion of intellectual ability usually is narrowed considerably in most jobs 


TABLE 10-4. Mepran VaLiDITY COEFFICIENTS or INTELLIGENCE TESTS 
ron VARIOUS OCCUPATIONAL Groups Ix THE PREDICTION OF Jon ProricieNcy 


(Adapted from E. E. Ghiselli and C. W. Brown, 17, p. 577) 
Occupational Median validity Number of 

group coefficient validity coefficients 

Clerical workers 35 85 

Supervisors 40 9 

Salesmen 33 4 

Sales clerks —.09 18 

Protective service 25 6 

Skilled workers 55 6 

Semiskilled workers .20 45 

Unskilled workers .08 13 


P e 8 


by the individual's gravitating toward a job at which he can work com- 
fortably and by the selection procedures that are used in industrial set- 
tings. Also, it is important to note that the variability of intelligence test 
scores is higher for lower-level] than for higher-level jobs. This indicates 
that, as would be expected, intelligence is a more important determiner 
of success in high-level occupations, The individual's ability to succeed is 
determined by his intelligence and by a host of other things as well: abili- 


ties not measured by intelligence tests, interests, personality traits, and 
just plain luck. d 
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CHAPTER 1l 


Special Abilities 


There are many human attributes to measure other than the kinds oi 
abilities which were discussed in the previous two chapters. There we 
talked about the components of intelligence or intellectual ability, These 
are usually thought of as the “higher processes,” the most prized of abili- 
ties. However, individual differences are as easily found, and somes 
as important to study, in sensory, motor, mechanical, and artistic abilities. 

In order to understand fully an individual's Dn 
much must be learned about him in addition to his intellectual SERE 
ties. Two persons could make the same score on a general intelligence Pu 
and yet be Very different in other important ways. Similarly, for all of 
factors which were described in Chapter 9, two persons could have p actly 
the same profile of scores and differ importantly in terms of other abilities. 
One person might be underweight and frail, the other an excellent po 
cal specimen. This would make quite a difference if they were both being 
considered for jobs as arctic explorers, test pilots, or steeple jacks. n: 
person might have a flair for mechanical work, whereas the other pum 
have little ability, This would be important to know if they were bath 
considering careers as mechanical engineers. There are many, many gu 
Ways in which the two persons could differ importantly—in d 
acuity, depth perception, finger coordination, musical ability, sense © 
balance, athletic ability, resistance to disease, and so on. +] 

It would be futile to try to list even a sizable fraction of the specia 
abilities that are touched on in psychological work. Instead, a number o 
abilities will be described which are important concerns in school place- 
ment, vocational counseling, and personnel selection. 


potentialities and liabilities, 


SENSORY TESTS 


The first tests in psychology, those of G 

; Nt i snsory 

lowers, were largely measures of Sensory ability: the fineness of sensor) 
thresholds, reaction time, and sensory 


igators 
acuity. The early Hv 
became discouraged with these measures because they failed to predic 
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school grades, teachers’ ratings of intelligence, and other criteria of intel- 
lectual accomplishment. However, sensory tests have come back in the 
last several decades as important measures in their own right. Attention 
will be given here to some components of vision and audition, omitting 
the interesting individual differences to be found among the remainder of 
the “five senses”: touch, smell, and taste. Indeed, the range of individual 
differences extends to numerous other senses that are seldom listed. There 
are internal senses like thirst, sensitivity to movement of the limbs, and 


awareness of specific organic tensions. 


VISION 


The popular practice of talking about “good” and “poor” eyesight does 
considerable injustice to the complexity of visual functions. There are a 
number of separable and only partially related kinds of “good” vision. A 
Primary distinction must be made between near acuity and far acuity. 
Near acuity concerns how well the individual can discern visual forms 
within 1 or 2 feet of his eyes. Far acuity concerns how well the individual 
can discern visual forms placed 20 or more feet away. A third component 
of "good" vision is depth perception, the ability to judge the proximity of 
Objects to one another. Another component is the ability to distinguish 
colors, Although it is commonplace to think of “color blindness" as a 
Unitary characteristic, there are different kinds of color blindness. Also, 
the ability to distinguish colors is partly a matter of degree rather than 
an all-or-none attribute. 

Wall Charts. The most familiar measure of visual ability is the ordinary 
wall chart, which nearly everyone has encountered in applying for a 
driver's license or taking a physical examination. The Snellen Chart is 
Used extensively for this purpose. It consists of rows of letters, each lower 
Tow containing smaller letters than the one above it. The chart is placed 
20 feet from the subject. If he can read the row of letters that the average 
Man can, he is said to have 20/20 vision. If he can read the row of letters 
that the average person can read at only 15 feet from the chart, he is said 
to have 20/15 vision. Similarly, if he needs to stand 20 feet from the chart 
to read the row that the average person can read from 40 feet, he is said 
to have 20/40 vision. 

Although the Snellen Chart is an adequate device for detecting gross 

Cficiencies in visual acuity, it has a number of disadvantages. Like all of 
the wall charts, it tests only far acuity. A school child could have excellent 
far acuity and still have crippling visual defects of other Kinds, Some 
alphabetical letters are easier to distinguish than others, and this is not 
taken into account when using the Snellen Chart. Also, the rows of letters 
are easy to remember, and the test can often be “faked” by the person 
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who has some prior knowledge of the chart. The amount of light on wall 
charts should be carefully controlled, but in much practical work this is 
given little consideration. When controlled conditions are obtained, the 
reliabilities of the Snellen and other wall charts are satisfactorily high. 
One study (51) reports a reliability coefficient of .88 for the Snellen 
Chart. 

Color Vision. One of the oldest tests for color vision is the Holmgren 
Woolens. The subject is given different colors of yarn and asked to sort 
the ones that are alike. It is a crude test which serves only to distinguish 
persons who are very deficient in color vision. A more systematic measure 
can be obtained with the Ishihara color plates.! The plates are composed 
of small patches of color. The person who has good color vision can see 
a number on the plate. The color-deficient person either does not see the 
number or sees a different number. More recent color vision tests are the 
Farnsworth Dichotomous Test for Color Blindness (15), the Farnsworth- 
Munsell 100 Hue Test (16), and the Illuminant-Stable Color Vision Test 
(19). In order to keep color vision tests adequately standardized they 
must be used in the same illumination and protected from fading or 
soilage. 

Multiple-component Tests of Vision. In recent years devices have been 
constructed which test a number of different aspects of vision, The three 
best-known instruments are the Ortho-Rater (Bausch and Lomb), the 
Sight-Screener. (American Optical Company), and the Telebinocular 
(Keystone View Company). Each of these instruments tests near vision, 
far vision, depth perception, color discrimination, and control of the eye 
muscles. In. general, the multiple-component instruments are a consid- 
erable advance over the older, single-component tests, 


AUDITION 


The sense of hearing is also composed of a number of different func- 
tions. Only auditory acuity will be treated here, the ability to detect 
faint sounds. Auditory acuity is itself complex: the person who can hear 
well at one tone level may be near deaf at higher or lower frequencies. 
Because of their relevance to musical aptitude, some of the other auditory 
functions will be considered in a later section. 

The older tests of auditory acuity employed sound sources like whis- 
pered speech or the ticking of a clock. In the whispered speech test, the 
examiner stands some distance from the subject and whispers a number 
of words. The subject tries to say what each word is in turn. The examiner 
walks farther and farther from the subject to determine the distance at 
which the whispered words can be heard. Although tests of this kind are 

* Obtainable from C. H. Stoelting Company. 
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adequate for detecting gross losses of hearing, they have a number of 
defects. It is difficult to standardize both the loudness and clarity of 
whispered speech. One examiner will inevitably whisper a bit louder 
and/or clearer than another in spite of the best efforts at standardization. 
Such tests measure auditory acuity within a narrow range of the tone, or 
frequency, continuum. The person who can hear whispered speech well 
might not be able to hear sound at a different frequency, such as the 
sound of a ticking clock. The difference in the acoustical properties of 
testing rooms and the problem of ruling out extraneous noises add to the 


difficulty of standardizing this type of test. 


Ficure 11-1, Audiogram of a child with severe high-tone deafness. (Adapted from 
L. A. Watson and T. Tolan, 48.) 
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A number of instruments hav: 
acuity at different points on the fre 


pure: tone audiometers (see 11, 48). E : : : 
at a time, The standard procedure is to gradually raise the sound intensity 


until the subject indicates that he can hear the tone. Then, starting with 
à sound that the subject can hear well, the intensity is lowered to a point 
Where it can no longer be heard. (You will remember this as one of the 
psychophysical methods for determining thresholds, the method of limits.) 
The procedure is repeated at different frequency levels. The resulting 
data can be plotted as a profile of auditory acuity (see Figure 11-1). 
Pure-tone audiometers are now available for group testing. Earphones 
are given to all of the subjects. Standard answer sheets are used for the 
subjects to indicate whether or not they hear the tone at different intensi- 
ties. It is not possible to determine the individual's auditory acuity as 
finely in the group testing situation. The individual who shows a marked 
loss in any frequency range on the group test should, if possible, be given 
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the individual test to determine more accurately the nature of his hear- 
ing deficiency. 


PRACTICAL USES FOR SENSORY TESTS 


The individual with a visual or hearing defect is at a definite disadvan- 
tage in many performance situations. Particularly, the young school child 
is handicapped. It is sometimes found that poor vision or hearing is 
responsible for apparent dullness. The child who cannot see the black- 
board or hear the teacher will learn very little regardless of his latent 
capacity for learning. Reading difficulties can often be traced to visual 
defects. In order to speak correctly the child must imitate the speech 
of others. He cannot imitate what he cannot hear. To do well on most 
psychological tests, the child must be able to hear the instructions and to 
read the test content. As is common practice in many school programs, 
children should be systematically tested at different age levels for audi- 
tory and visual ability. 

Even the sophisticated adult is often unaware that he has a visual or 
auditory defect. It often happens that individuals who wear glasses for 
the first time are surprised that the world “looks that way.” Persons with 
poor far acuity often take it for granted that everyone sees objects more 
than 50 feet away as “big blurs.” Consequently, the individual cannot be 
relied upon to detect his own sensory difficulty and seek help. He is likely 
to blame his inability to perform well on “dumbness” rather than on a 
visual or hearing loss. 

Sensory ability is necessary for many different occupations. The classic 
example is the baseball umpire's dependence on "good eyesight." Color 
discrimination is paramount to the interior decorator, the tailor, and the 
artist. A moderate level of auditory acuity is ne 
particularly if it is required to talk with others or to follow spoken instruc- 
tions. However, there are few jobs in which high-level sensory ability is 
the primary attribute. Even jobs that would seem to depend heavily on 
sensory acuity usually require only average ability. A typical example is 
that of a sonar operator, who uses a sound echoing device to detect hostile 
submarines. It was found that up to a certain point auditory acuity is 2 
“must” for the job; but above the level of average hearing, auditory acuity 
is not a predictor of good and poor sonar operators. N 

Sensory disabilities are often prominently involved in adjustment prob- 
lems. It is not uncommon for the person who is partially cut off from his 
environment because of a sensory disability to become withdrawn, de- 
pressed, and resentful. Sensory tests can be used to detect difficulties in 
audition and vision. Correction or compensation for the 
often leads to an improvement in the individual's pe 


cessary for most jobs. 


sensory disability 
rsonal relations. 
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MOTOR DEXTERITY 


It has long been recognized that the person who works well with his 
“head” does not necessarily work well with his hands. Accomplishments 
like shaping a fine piece of pottery, hitting a home run, and operating 
complex machinery, have little to do with intelligence or with formal 
school training. 

Among the oldest motor tests are the peg boards, designed to measure 
arm, hand, and finger dexterity. A typical example is the Stromberg Dex- 
terity Test (46; see Figure 11-2). The first part of the test requires the 


(Courtesy of The Psychological Corpora- 


Ficure 11-2, Stromberg Dexterity Test. 
tion.) 
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Ficure 11-3. Crawford Small Pa: 
Corporation. ) 


rts Dexterity Test. (Courtesy of The Psychological 


Complex Coordination Test (36) 
Air Force (see Figure 11-5). The 
cockpit, complete with stick 
late the maneuvers of an 


used for the selection of pilots by the 
test is a partial replica of an airplane 
and rudder, Lights on a control panel simu- 
airplane. The subject must use stick and rudder 
to match the stimulus lights, the counterpart in the test of coordinating 
Stick and rudder as required by the situation in an airplane. 

Analysis of Motor Tests. Most tests of motor dexterity are highly 
dependent on speed. Consequently, they prove to be better predictors of 
jobs in which speed rather than quality is important. There are many jobs 
in which speed is only a minor consideration. The person who can saw à 
board quickly does not necessarily have the craftsmanship of the skilled 
cabinetmaker. 

Motor dexterity tests have acceptably high reliabilities, usually over 80 
and sometimes over .90. An important characteristic of motor tests is that 
they tend to correlate very little with one another. Slightly different 
manipulations of the same material often have little in common. For 
example, the two parts of the Crawford Small Parts Dexterity Test corre- 
late on the average less than .50 (10). A correlation of only .57 was found 
between the two parts of the Stromberg Dexterity Test (46). Correlations 
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tool Dexterity Test. (Courtesy of The Psychological 


Ficunz 11-4, Bennett Hand- 
Corporation.) 


between different motor tests prove to be even smaller. Factor-analytic 
studies of motor dexterity tests have generally found few broad common 
factors. Tests in this area are characterized by high specificity. 

Because of the small overlap between motor dexterity tests, there are 
no general measures of motor ability such as the general intelligence tests 
Supply for intellectual functions. Motor tests are then of relatively little 
Use in vocational counseling. They are most legitimately used in industrial 
Selection where the job is simple, requires a definite set of motor skills, 
and is highly dependent on speed. Among jobs of this kind are those in 
production line work, sewing machine operation, and packaging. 

Motor dexterity tests show at best only moderate predictive validity 
for most situations in which they are used (see Tables 11-1, 11-8, 11-4, 
and 11-5). However, if they are used in conjunction with other ability 
tests, they often add a small, but important, increment to the over-all 
Validity of the battery. Motor tests tend to be more valid when they are 
made to resemble the actual machine or instrument which is featured on 
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Ficure 11-5. Complex Coordination Test. ( Courtesy of the U.S. Air Force.) 


the job. Tests designed in this way are called job miniatures. If the job is 
that of lathe operator in a machine shop, the best motor test would 
employ a miniature lathe with the same kinds of dials, handles, and con- 
trols that appear on the real lathe. During the Second World War, the 
Army Air Force used a variety of motor tests for the selection of pilot 
trainees. The Complex Coordination Test, which resembles most closely 
what the pilot actually does, generally proved to be one of the most 
valid instruments. 


MECHANICAL APTITUDE 


Mechanical ability is popularly thought of as concerning the making 
and fixing of things as distinct from clerical, sales, administrative, an 
professional abilities. In general, we speak of mechanical ability in rela- 
tion to trades and various levels of skilled work. There is no fine dividing 
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line between mechanical occupations and those that are not mechanical, 
Some occupations that most of us would classify as mechanical are 
plumber, carpenter, automobile mechanic, and boat builder. 

There is no one type of test function which underlies mechanical work 
to the same extent that the general intelligence tests relate to school work. 
In order satisfactorily to predict a particular mechanical job, a range of 
different kinds of tests must be used in a battery. Different combinations 
of tests are usually needed for different jobs. Some of the kinds of tests 
that have proved useful in the prediction of mechanical work and in 
vocational guidance are described in the following sections. 

Intellectual Ability. Because an individual is involved in making and 
fixing things, it does not mean that intelligence is an unimportant 
attribute. When it is possible to do so, either a battery of the major intel- 
lectual factors, or, at least a general intelligence test, should be tried as a 
predictor, The spatial and perceptual factors are very useful in predicting 
many mechanical jobs. Some tests embodying these functions will be 
considered in the subsequent discussion. However, it is important also to 
consider the verbal, numerical, and reasoning factors as well in the pre- 
diction of mechanical work. Because the “verbal” intelligence tests are 
mainly composed of these factors, a good general intelligence test is oben 
one of the best predictors of job success (sce Tables 11-1, 11-2, 11-3, 


11-4, and 11-5). 


TABLE 11-1, CORRELATIONS or ABILITY Test Scores WITH Measures or Jon 
PERFORMANCE 
(Adapted from E. E. Ghiselli, 20) 


EM Type of job 
lest MH ] 


Protective Skilled Semi- 
trade skilled Unskilled 


Clerical service 
og 45 .16 
General intelligence 36 = s" 
Arithmetic a2 C 15 
Number comparison 28 ? 45 27 
Spatial relations 06 ae 45 
Mechanical principles 20 21 .05 


Finger dexterity 


ests tend to be more predictive of how well the 
individual does in job training than how well he purfaums subsequently 
on the job. This is probably because the training phase requires more 
abstract abilitv. In many cases the training program involves classroom- 
reading of materials, and the learning of machine opera- 


General intelligence t 


like procedures, 
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tions. These are the kinds of things that intelligence tests predict best. 
Predictor tests usually correlate higher with performance in training than 
with later job performance because, as a tule, performance is more reli- 
ably measured in the former than in the latter. Progress in training is 
usually graded more carefully. There is more of an opportunity to obser ve 
the worker, and, in many cases, tests are used to assess progress in training. 


TABLE 11-2. MINIMUM MENTAL AGEs FOR SEVERAL Joss 
(From A. S. Beckman, 3) 


Mental 
age, Boys Girls 
years 
5 Dishwasher Sewer (simple patterns) 
Vegetable parer 
6 Mixer of cement Mangle operator 
Freight handler Crocheter (open mesh) 
T Painter (rough work) Cross stitcher 
Shoe repairer (simple tasks ) Hand-iron operator 
8 Haircutting and shaving Scarf-loom operator - 
Gardener Dressmaker (not including patte 
work ) 
9 Foot-power printing-press operator Fancy-basket maker 
Mattress and pillow maker Cook (simpler dishes) 
10 Sign painter Sweater-machine operator 
Painter (shellacking and varnish- Launderer 
ing) 


11 Storekeeper 


Librarian’s assistant 
Greenhouse attendant 


Power sealer in cannery 


The general intelligence tests tend to be more predictive of success iB 
high-skill rather than low-skill jobs. That is, validities are usually higher 
for such jobs as electrical technicians and complex machine operators 
than they are for jobs like truck drivers and furniture movers. The dif- 
ference in validity is probably due to the increased importance of abstract 
ability in more highly skilled work. In selecting people for unskilled 
work, the problem is to set up minimum standards of intelligence rather 
than to seek persons of high intelligence (see Table 11-2). 

Spatial and Perceptual Tests. A wide variety of mechanical work re- 
quires the spatial and perceptual factors which were described in Chap- 
ter 9. The automobile mechanic needs spatial orientation in his work. In 
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a typical job situation he is lying under the automobile and must remove 
a nut from the engine above him, The nut is slanted at a 45-degree angle, 
and he must remove it with a wrench that has two joints. The mechanic 
must orient himself spatially to the complex of angles and movements in 
order to do such work. In draftsmanship, it is necessary to portray three- 
dimensional objects on two-dimensional pieces of paper. In some 
drawings the objects must be shown in tilted positions or partially dis- 
assembled. It takes spatial ability, either spatial orientation or visualiza- 
tion, to work as a draftsman and at many other jobs. 
Ficure 11-6. Sample items from the Revised Minnesota Paper Form Board. 


For each item, the subject must choose the figure which would result if the 
pieces in the first section were assembled. (Courtesy of The Psychological Corpo- 


ration.) 


One of the best known spatial tests for mechanical aptitude is the 
Minnesota Paper Form Board Test (33, 39). It is â r pcm oie of 
grades in shop courses, supervisors ratings of workmans — le — pao 
duction records, and many other measures of mechanical performance 
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visual scene requires perceptua i . 
Bessie of 8 tests can be found in some of the multifactor 


batteries which were discussed in Chapter 9. Some tests designed spe- 
cifically for industrial selection will be described in the section on clerical 
aptitude, (See Tables 11-1, 11-3, 11-4, and 11-5 for typical validities. ) 
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250 Tests and Measurements 
TABLE 11-5. AVERAGE VALIDITY COEFFICIENTS OF VARIOUS APTITUDE TESTS 
FOR PROTECTIVE OCCUPATIONS, SERVICE OCCUPATIONS, 
AND VEHICLE OPERATORS 
(From Ghiselli and Brown, 21, p. 229) 


Protective Service Vehicle 
occupations occupations | operators 
Type of test | ; — — — aa. Broke 
| Train- Profi- Profi- | Train- | Profi 
| ing | ciency | ing | ciency ing ciency 
Intellectual: | | 
Intelligence ............. AG | 50 | 07 18 14 
Immediate memory ...... | 28 | ` 
Arithmetic ......00..0... 30 59 | —a1 4 04 
Number compa ]| es | 44 | 
Name comparison ........ ar | —.17 | 
Cancellation ...,,....... | — 9 
Spatial and perceptual: | | 
Location | eo d ue d ace sse | cat 18 
Spatial relations... | 33 04 42 T 21 
Speed of perception ...... 30 sm. | 08 
Mechanical principles... 41 20 36 E 
Motor: | | | 
Tapping S 353 Cs v» e even ag b. ite 888 8 | dw e 32 
Dotting | say 28 
Finger dexterity 19 | 
Hand dexterity —.09 
Arm dexterity . | =0L 
Simple reaction time . | 37 
Complex reaction. | | | 35 


Motor Dexterity. The kinds of motor tests which were discussed earlier 
in the chapter are often featured in mechanical aptitude batteries. Motor 
tests were placed in a Separate section because they are involved in ? 
number of aptitudes other than mechanical aptitude, athletic aptitudes. 
for example (see Tables 11-1, 11-3, 11-4, and 11-5 for typical validities). 

Mechanical Comprehension. Among the most successful tests ° 
mechanical aptitude are those designed to measure the mastery © 
mechanical principles, or the ability to reason with mechanical problems. 
In a typical problem, a truck driver is rushing medical supplies to a fire- 
damaged town. He discovers that his truck is about an inch too tall to 
clear a bridge leading to the town, What should the truck driver do? The 
answer is to let some air out of the tires, As another example, a motorist 
must remove a boulder which blocks the road. He finds a long. stout 
pole to do the job. He must then decide whether to use the pole as a pty 


Special Abilities 251 


or a lever, He can construct a lever by balancing the pole on a rock 
placed between the boulder and himself (the lever exerts more force 
than a pry). After deciding to use the lever action with the rock as a 
balancer (a fulcrum), he must then decide where to place the balancer 
(rock) and how to best exert his strength against the pole. As a third 
example, a hoist is being built to lift tree trunks into the bed of a truck. 
A system of gears and chains is set up to transfer the power from an 
electric motor to the hoist. It must be decided how large the different 
Sears should be to give the desired power to the hoist. 


bag 11-7. Sample items from the Bennett Mechanical Comprehension Test, 
orm AA. (Courtesy of The Psychological Corporation. ) 


[r2] Which room has more of on 


echo? 


Which would be the better 
sheors for cutting metal? 


are typical of those found in mechanical 
s kind tend to range over several ability 


factors, Because of the paucity of factor-analytic studies of mechanical 
comprehension tests, it is not always possible to say just what a particular 
test measures. Some of them emphasize the spatial factors, which are 
Prominently involved in many tests of mechanical aptitude. Other func- 
tions which appear in some mechanical comprehension tests are numer- 
ical computation, various aspects of the reasoning factors, and familiarity 
With tools and machinery. Examples of mechanical comprehension tests 
are the Bennett Mechanical Comprehension Test (5, 6) (see Figure 
11-7), the Mechanical Reasoning Test of the DAT (discussed in Chap- 


ter 9), and the SRA Mechanical Aptitude Test (40). 


The three problems above 
comprehension tests. Tests of thi 
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Mechanical Information. One of the most useful measures for the selec- 
tion of skilled and semiskilled workers is a test of information. or knowl- 
edge, about tools and machinery. For example, a set of questions like the 
following would be useful in the selection of automobile mechanics: 
What is a torque converter? 

What is a ratchet? 

Where is the "needle valve" in an automobile? 

What source of power is used to run the generator in most auto- 
mobiles? 

5. How do you recognize “pre-ignition”? 

Information tests can be constructed either to measure general knowl- 
edge of mechanical work or to measure knowledge of one particular job. 
It is usually the case that a test constructed specifically for one job, such 
as for television repair men, will be more predictive than a test of 
knowledge in general about mechanical work. However, the test which 
is constructed specifically for one job is likely to be useful only for select- 
ing personnel for that job or for closelv related jobs. Also, because of the 
specific knowledge which the instrument measures, it is usually of little 
use in vocational guidance, where 
functions rather than hi 


mg» por" 


it is usually necessary to measure broad 
ghly specialized information, A more general 
measure of mechanical knowledge is often useful in vocational guidance: 

Analysis of Mechanical Aptitude Tests. A sufficient personnel-selection 
program usually requires a careful study of the particular industrial 
setting. The diversity of psychological functions which is required by 
different jobs makes it nece uy to try out a range of tests to find the 
ones that will work well in practice, Also, it is often necessary to invent 
and construct tests for particular jobs. : 

Few of the mechanical aptitude tests have been studied as extensively 
as the tests of intellectual ability, Consequently, it is usually necessary t 
perform considerable research in the job setting to determine the utility 
of particular tests. In few cases have norms for the tests been obtainec 
VER sufficiently representative sample to use them as dependable guides: 
It is generally more meaningful to obtain local norms for particular per 
sonnel selection or school programs. 

Mechanical aptitude tests are fairly easy to standardize and make 
reliable. Reliabilities in many cases approach those of the better gener 
classification tests. Mechanical aptitude tests have modest validity for 
many different jobs, the amount varying considerably with. the job. The 
seemingly small validities for some jobs should be regarded from a nun 
ber of standpoints. Primarily there is the possibility that formal abilities 
such as those measured in psychological tests have little to do with job 
success. Another possibility is that the criterion of job success, the assess” 
ment, is unreliable and consequently cannot be predicted. If the as 
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sessment is determined only from the sketchy impressions of foremen 
and managers, it is seldom very reliable. It is meaningful in this instance 
to make the correction for attenuation as discussed in Chapter 6, correct- 
ing for the unreliability of the assessment. Because of the unreliability 
inherent in most job assessments, mechanical aptitude tests are often 
considerably more valid than apparent from the correlation coefficients 
(this was discussed more fully in Chapter 6). The third point to con- 
sider is that the modest to low individual validities should not obscure 
the fact that a combination of several tests in a battery will often produce 


reasonably good predictive efficiency. 


CLERICAL AND STENOGRAPHIC APTITUDES 


Clerical Aptitude. Clerical aptitude, as it will be discussed here, con- 
cerns the office clerk, the individual who deals with files, ledgers, ac- 
counts, and correspondence. The term clerk is much more general than 
this particular usage. referring variously to grocery clerk, department 
Store clerk, and even court clerk. Perceptual speed tests have a special 
Importance in the prediction of clerical performance. Perceptual speed 
is involved in chores like proofreading letters, searching for particular 
accounts in a long list, and alphabetizing names. 

A typical test which emphasizes perceptual speed is the Minnesota 
Clerical Test (1). The test is divided into two separately timed parts, 
Number Comparison and Name Comparison (see Figure 11-8). Retest 


Ficurr 11-8, Sample items from the Minnesota Clerical Test. (Courtesy of The Psy- 


chological Corporation.) 


527 
New York World ————— 
Cargill Grain Co. ———— 


Cargil Grain Co. 


If the two names or numbers of a pair are exactly alike, make a check mark on the 


line between them. 


85 to 91. Extensive validation research shows 
60 between the Minnesota Clerical Test and 
business school and job performance criteria. Other perceptual speed 
tests for clerical aptitude are the Clerical Speed and Accuracy Test of 
the DAT, the Clerical Perception Test on the GATB, and parts of the 


General Clerical Test (50). i 

Perceptual speed is a necessary component of many clerical jobs, but 
dias fcu dud De tested as well. Verbal comprehension is a 
desirable attribute, especially edge of spelling and grammar. 


reliabilities range from 
correlati a 
orrelations ranging up to 


a knowl 
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Clerical work often involves routine arithmetical operations, and Tiere 
fore, a numerical computation test is likely to be a useful selection "€ 
ment. If the clerical job requires the use of accounting machines or oie 
equipment, some of the motor dexterity tests may be of use. The end 
tests not only perceptual speed but a variety of verbal and arithme ie 
abilities; hence, it probably would serve well in the selection of office 
clerks. The General Clerical Test (50) is a short battery designed to 
cover the major functions required in clerical work. Scores on nine sub- 
tests are combined to form clerical, numerical, and verbal scores. 
Stenographic Ability. The selection of stenographers is best made an 
the basis of specific job requirements. Typing and shorthand ability pre 
the major requirements in most stenographic jobs. Therefore, proficiency 
tests in these skills offer a sound basis for the selection of stenographers 
(for typical tests, see 7, 8, 47). Here as in many other testing problems, 
the needs of a selection program are not always the same as those of the 
vocational guidance situation. In vocational guidance, before the indi- 
vidual has had an opportunity to test himself in specialized training OT 
on the job, some prediction must be made as to how well he will perform. 
Although an insufficient amount of research has been done on the EDU 
tude for stenographic work to allow us to speak with certainty, the most 
promising attributes seem to be the motor skills involved in typing, meas- 
ures of verbal comprehension and language usage, and interest in steno- 
graphic work. 


ARTISTIC APTITUDES 

The nature of art and artistic ability have been matters of interest iD 
psychologists for well over a hundred vears, dating back to and before 
Fechner. In spite of this long interest, the measurement of artistic ability 
lags behind the testing of other ability functions. This is due in part te 
practical considerations. There has always been a more urgent need for 
intellectual and vocational tests than for tests of artistic ability. Research 
on classifying men in the Armed Forces, testing children in school, ane 
selecting men in industry, has won financial support because gE the 
immediate gains to be expected. Although the study of artistic ability 
offers some practical advantages, it never has promised a sufficient com- 
mercial market to attract large numbers of psychologists. ! 

Another reason that tests of artistic ability lag behind is the intrins!¢ 
complexity of the functions to be measured, Tn this area, it is very difficult 
to distinguish aptitude from achievement. The accomplished musician 
or painter can be judged by what he currently does. But it is difficult to 
find the underlying aptitudes that give one child an advantage over 
another in reaching eventual artistic accomplishment. 

Good art is largely a matter of time and place. Chinese music sounds 
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cacophonous to us, and no doubt much of our music seems strange to the 
Chinese. Some primitive music centers almost entirely on the drum and 
other percussion instruments. Complex rhythmic patterns used are too 
clusive for the “civilized” ear. We would miss the esthetic appreciation 
that the primitive has for his music as much as he would be baffled by the 
symphony orchestra. The delicate sensitivities of a Japanese poem are lost 
on an Occidental audience. We might cite many other examples to show 
that art is a matter of values. Different people have different values, and 
values change over the ages. 
Different abilities are involve 
appreciation of art. The music critic 
historian may have never painted. Different 
production of different kinds of art work. 
The measurement of artistic aptitude evolves into several components. 
For producing works of art there are probably some underlying abilities 
that eut across different times and different cultures. In graphic art work, 
the ability to make line drawings, to combine colors, and to achieve prop- 
erties of “balance” are required in most paintings. In musical ability there 
are the basic sensory skills of tonal memory, sense of pitch, and recogni- 
tion of rhythms, which to some extent cut across different kinds of musical 


production. 
Another attribute which can be t 


d in the production of art and in the 
may not be a musician at all; the art 
abilities are required in the 


ested is the appreciation of art forms. 


Appreciation is dependent on the values in a particular culture and on the 
individual's knowledge and acceptance of those values. Finally, tests can 
be made of how well the individual can produce particular art forms; 
such achievement is dependent both on his initial aptitude and the train- 


ing that he has had. 

MUSICAL APTITUDE 
and most widely used musical 
alents (42). The test stimuli 
an be used for the testing 
y includes the following 


f the oldest 
s of Musical D 
h records, which c 
bjects. The batter 


Seashore Measures. One © 
tests is the Seashore Measure 
are reproduced on phonograp 
of moderate-sized groups of su 


Subtests: 
1. Pitch discrimination. The subject is asked whether the second of 


two tones is higher or lower than the first. The items are made 
progressively more difficult by decreasing the difference in pitch 


between the pairs of tones. 
Loudness discrimination. The s 
is louder. 
8. Time discrimination. One tone is presented for a longer period 
of time than another. The subject judges which of the two tones 


is longer. 


ubject judges which of two tones 


12 
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4. Rhythm judgment. The subject judges whether two rhythmic pat- 
terns are the same or different. — 
5. Timbre judgment. The eei judges whether or not two 
are of the same musical qua ity. 
6. Tonal memory. Two Ai — of notes are played. In the — 
series one of the notes is altered. The subject judges which o 
notes is different. T 
Scores on the subtests correlate near zero on the average with in ne 
gence tests (17). The subtest scores are partly independent, with mec m 
intercorrelations ranging from .48 to .25 for different samples (17). i E 
half reliability estimates for the subtests range from .62 to .88. Rhy 85 
and timbre are the least reliable. If, as is often done, the six subtests rw 
added to form one general measure, high reliability can be expected A 
the test. Except for large differences in scores, the subtests are not su 
ciently reliable for considering differential aptitudes within the test. - 
Scores on the Seashore test are affected very little by age. Similar - 
are found for grammar school, high school, and adult populations.“ n 
though the research results are somewhat contradictory (17, 41, 44), 10 
seems that scores are affected only slightly by musical training. These tw : 
findings taken together suggest that the Seashore subtests measure -— 
basic aptitudinal functions which are possibly inherited. The larger 1 
tion is whether the aptitudinal functions involved in the tests are of an) 
importance in predicting musical accomplishment. i 
An insufficient amount of research has been done with the Seashore Hes 
to speak with firmness about its predictive utility. Modest to small dm 
lations have been found with grades in music classes and with aput 
ratings of musical ability (12, 24, 31, 38). The test differentiates me 
erately well between students who complete specialized musical — 
and those who drop out. It is reasonable to think that at the level of ape 
cialized music training most of the Persons with poor ability in the Sea 
shore type of measures will have alre; 


ady been eliminated. 15 
e d "oc are n 
A number of persons have argued that the Seashore measures are ! 


very similar to the skills that are involved in the actual production e 
music. The Seashore Subtests measure certain types of sensory discrimina- 
tion which might be necessary for musical ability but not sufficient. yea 
the Seashore test would have its most important value is in helping 
parents decide whether their children would be likely to profit from 
extensive musical training. This would save considerable monev anc 
would keep the neighbors from having to hear little Susan grind away 
vears at an instrument she will never master, At present the Seashore M 
is difficult to administer below the age of ten, and the predictive validity 
of the test at younger ages is not known, 
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Wing Test. The Wing Standardized Tests of Musical Intelligence (49) 
ossible to the skills involved in musical 
the Seashore test, the Wing test uses 
ing seven functions are tested: 


were designed to stay as close as p 
production and appreciation. Like 
phonograph recordings. The follow 
1. Chord analysis. Judging the number of notes in a chord. 
2. Pitch change. Judging the direction of change of notes in a re- 


peated chord. 

3. Memory. Judging which note is c 
phrase. 

4. Rhythmic accent. Judgir 
has the better rhythmic pattern. 

5. Harmony. Judging which of two harmonies is better for a particu- 
lar melody. 

6. Intensity. Judging which of two pie 
pattern of dynamics, or emphasis. 

7. Phrasing. judging which of two versions has the more appropriate 
phrasing. 

The first three subtests measure 


hanged in a repeated melodic 


ng which performance of a musical phrase 
ces has the more appropriate 


complex sensory abilities. The other four 


concern the esthetic value of different compositions. The subtest scores 


àre added to form one general measure of musical aptitude. 
avorable response from teachers of music, 


The Wing test has received f 3 à 4 
Who feel that the test covers many of the skills that are important in 
musical training. Little is known about how well the test can predict 
available criteria. The author (49) reports correlations of .60 and above 
between the test and teachers’ ratings of musical ability in three pad 
groups. It is possible that the Wing test will prove to be a better differ- 
entiator of musical talent at higher levels of ability than the Seashore. 


The Wing test might then be useful in the guidance and selection of stu- 
dents who want to go on from some initial musical training to more 
advanced training. 

There are a number of other te 
tones and musical phrases. The 
emphasizes the memory compon 


sts based on phonographically recorded 
Drake Musical Memory Test (13, 14) 
ent, which appears in only one subtest 


of the Seashore and Wing batteries. The Drake memory mr apee 
in that the items concern short musical melodies eee ol ge 0 
tones. The subject must determine whether two oe. ae the wid 
and if not, whether the change has been made in the gH - he timing, or 
in specific notes. The Kwalwasser-Dykema Music Tests ^s ) (ns a 
battery of 10 subtests. Six of the subtests are similar to e a the Sea- 
shore. The subtests cover much the same ground as the Hct plus 
the ability to read musical notation and some components o musical 


üppreciation. 
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Analysis of Musical Aptitude Tests. Not enough research ion mn 
to say how well the current tests work. A particular problem 5 à; " tn fie 
of adequate criteria of musical accomplishment. School grac Ge 
history, techniques, and general knowledge of music are the pes fron. 
indices. But these are not the same as artistry in musical gue "— 
Judgment of the actual mastery of musical instruments must 3 en) 
be based on the impressions of teachers and other persons, anc -— zt 
sions of this kind usually have only modest reliability. Even if the 5 ws 
some difficulties in validating the instruments, much more resear 
should be done to determine how well they work. m 

The tests which were discussed in the previous sections are all, str wid 
speaking, tests of appreciation. That is, the subject is not required ag tte 
to play an instrument but only to listen and judge what he hears. alk 
ever, some of the complex judgments involved seem to underlie the s f 
that are needed in musical production. It is likely that other types * 
tests could be used in conjunction with the conventional measures in 
obtain a better estimate of musical ability. Motor skills are involved a 
playing most musical instruments, the piano being an outstanding me 
ple. Motor tests might be profitably used in the prediction of cm 
accomplishment. Although intelligence tests correlate very little with i 
available musical tests, this does not mean that they would be of no ra 
in predicting musical accomplishment. It would be expected that inte A 
gence and, more generally, the factors which underlie differential ap á 
tude tests would be useful in the prediction of course grades in muslc? 
curricula and in special music schools. m 

It is likely that an individual’s interest in musical work will be as al 
dictive of later success as tests of the ability type. Two such interest re 
are the Farnsworth Scales (18) and the Seashore-Hevner Tests for Att! 
tude toward Music (43). The small amount of research that has been 
done indicates some promise for tests of musical interest. 


GRAPHIC ART 


McAdory Test. The field of graphic art testing has been dominated d 
à particular type of item, in which a "masterpiece" is compared with one 
or more altered versions of the same work. One of the oldest tests of ^i 
kind is the McAdory Art Test (34), which came out in 1929, The tes 
contains pictures of 72 art works covering a wide variety of contemporary 
art forms, ranging from pictures of furniture and automobiles to works Hs 
art in museums. Four versions of each art work are given; these differ e 
shape, arrangement, shading, and use of color. The subject is required 
rank-order the four versions in terms of his preferences. 


; of 
Items for the McAdory test were selected in terms of the judgments 0 


Special Abilities 259 


experts, including teachers, critics, and artists. Items were retained only 
if at least 64 per cent of the judges agreed on the ranking of the four ver- 
sions of each picture. A primary weakness of the test is its dependence on 
contemporary art values. For example, it is likely that the preferences for 
furniture, automobile design, and even paintings have changed since the 


test was constructed. 


Ficurr 11-9. Illustrative items from Meier Art Judgment Test. (Reproduced by per- 


mission of Norman C. Meier.) 


Meier Test. The Meier Art Judgment Test (35) i b Baier most 
Widely used test of art appreciation. It also uses the a — -v ps type 
of item. The test differs from the McAdory in that on y one a 3 
version is given for each original art work, and the items concern rela- 


tively timeless art masterpieces: The items are all F black = 1 5 
The altered version of each masterpiece 1 meant to | estroy the esthetic 
version, one figure is moved to the side 


Organization. In a typical altered dyes : 
in such a way as to 5 the balance of the painting (see Figure 11-9). 


The initial selection of items was made on the basis of expert judg- 


gme 


260 Tests and Measurements 


ments. Items on which there was high agreement among 25 experts were 
retained, The items were further pared down in terms of internal con- 
sistency statistics. Only those items showing a high correlation with the 
total score were placed in the final form. 
Split-half reliabilities for the Meier test range from .70 to .84 in rela- 
tively homogeneous groups of subjects. Scores correlate only negligibly 
: with traditional measures of intelli- 
gence. Only a small amount of ye 
search has been done to determine 


how well the test predicts available 

E] criteria. It has been shown that the 
test differentiates art students from 

non-art students and different art 

C] students in terms of the amount of 


training that they have had. A cor- 
relation of .46 was found with the 
grades of 50 art students (27). Cor- 
relations ranging from .40 to ee 
were found with ratings of creative 
art talent (9, 37). 


V © Graves Test. To remove the : 
i j as much as possible from traditiona 


(a) (5) 


. the 
and contemporary art values. s 
Graves Design Judgment Test (225 


23) consists entirely of abstract de 


qp e» signs (see Figure 11-10). Each test 
(a) 


item consists of either two or m 
(4) versions of the same basic PR 
o alte: ersi r versi were 
Ficure 11-10, Illustrative items from The altered eee à es- 
Graves Design Judgment Test. (Courtesy constructed to violate acceptec jf 
of The Psychological Corporation.) i thetic principles. The judgments 9 
vere 
art teachers and art students 1 
an original list of 150. Split-half ne 3 
) à .93. Although the Graves test gie 
promise of being a useful measure, only a small amount of empirica 
work has been done with the instrument.“ 
Worksample Tests. A number of t 
individuals can actually produce gr 
Typical of these is the Horn Art Aptitude Inventory (25, 26), which 
includes the following subtests (see Figure 1111): 
1. Scribble exercise. Making outline drawings of 20 simple objects 
2. Doodle exercise. Making abstract compositions out of simple geo 
metrical forms 


used to select the best 90 items from 
ability estimates range from .81 to 


oll 
ests were designed to test how W€ 
aphic art works. 
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3. Imagery. Working from a given set of lines to a completed compo- 


sition 
Other tests which are largely concerned with the production of art works 
d the Lewerenz Tests in Funda- 


are the Knauber Art Ability Test (29) an 
mental Abilities of Visual Art (32). 


Ficunz 11-11. Sample item from the Imagery Test of the Horn Art Aptitude Inven- 


ray The subject is shown only the lines in rectangle A from which he is to make a 
drawing. Examples of completed drawings are shown in B and C. (From C. Horn, 25; 


reproduced by permission of the American Psyc! 


hological Association. ) 


y 


udgments of graders. Product 
is compared to a standard 
ore level. The grader gives 
apparent nearness in quality of the sub- 
le examples. In spite of the apparent 
moderately high reliabilities are 
type (25, 26, 28, 29). Current evi- 
0 ades as well 


y on the j 
articular drawing 
ailable for each sc 


The worksample tests must rel 
Scales are used, in which a p 
Set. Sample drawings are av 
score in accordance with the 
Ject's drawings to the product eis 
Subjectivity of the scoring system» 
reported for tests of the worksample 
dence indicates that the worksample 
as the appreciation tests and do a bette 
9f creative ability (2, 26). 

. Analysis of Graphic Art Tests. The. d KN 
ing musical aptitude are magnified in the measurem 


tests predict course gr 
75 we 
r job of predicting teachers’ ratings 


ifficulties of defining and measur- 
ent of graphic art 
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aptitude. Since the graphic arts are more dependent on ey 
of accomplishment are weaker; hence the underlying 1 3 
difficult to determine. Unlike the sensory discrimination mE ope 3 
cal aptitude, the current measures of graphic art Renee. a ids 
depend heavily on training. Consequently they are of less — : m ulld 
guidance of prospective art students, where all tests of art aptitu 
seem to have their most promising use. — 
Current tests are biased toward certain cultural groups. For * a Le 
it was found (45) that much lower scores on the MeAdory test are pus 
by Navajo Indian children than by children in New York City, pee. of 
of the fact that the Navajo culture has a highly developed art R vell 
its own. The available tests appear in most cases to be clever "i v s 
designed, but the paucity of research which characterizes the en. E 
artistic abilities leaves many questions about how well the tests wet vill 
practice. Perhaps future factor-analytic studies of graphic art tests — 
lead to a better knowledge of the underlying functions and how they cà 
best be measured, 
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CHAPTER 12 


Achievement Tests 


The achievement test is a natural outgrowth of the teacher's own class- 
room examination. Now achievement tests are used more broadly to assess 
levels of skill in numerous endeavors in addition to school work—in mili- 
tary, industrial, and particularly in government civil service work. Most 
of the instruments discussed in this chapter are commercial tests which 
can be purchased and used in a wide variety of training programs. 

In terms of the effort spent in test construction and the number of tests 
that are used, achievement tests overshadow all other psychological meas- 
ures, They also have an immediate importance not shared by any other 
type of test. Achievement tests are helpful in assigning school grades, 
m planning school curricula, in remedial training, and in vocational guid- 
ance. Achievement tests are of sufficient importance to merit the work of 
hundreds of specialists, and the tests are used in such quantities as to make 
the development of a new instrument à sizable commercial. venture. 

umerous books (see 1, 2, 13, 17) are devoted either wholly or in part to 
the development and use of achievement tests. 

Aptitude and Achievement. Achievement tests are meant to measure 
accomplishment, as distinct from aptitude tests and intelligence tests, 
Which are intended to measure the capacity for accomplishment. In some 
Situations it is meaningful to distinguish aptitude and accomplishment 


8 though it is difficult to measure them separately. . 
s an interaction of more basic apti- 


Achieve sts necessarily measure s 
tudes e place in specific training situations. As 
à group, children lic do well in school work must have at least mod- 
erately high aptitude for such training, combined with personality traits 
and interests which lead them to accomplishment. However, it is not 
Necessarily true that children who do poor work in school E lacking in 
aptitude. Some of them may be suffering from physical difficu M such as 
poor vision, from disrupted personalities, and from — ly impov- 
Crished or unwholesome family environments. Children of n x some- 
times show marked improvement in school progress after their physical 


nd persona] problems have been dealt with. 
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z ary 
Validity of Achievement Tests. The achievement = vid parer 
example of an assessment as described in Chapter 4. T 1c im, ium hrs 
achievement test is to measure an individual's — in Hi eit 1 
specific unit of instruction or in à more general course of i na E irs 
achievement test might be a good predictor of later school grac m “anit 
fessional accomplishment, but this is not its primary purpose. 0 deca of 
is hoped that achievement at one stage of training will be pieri 1 8 
later success, it would be unfair to judge achievement tests on U e 
alone. If predictions of behavior were the standard for achievement tes 
it would be very difficult to decide which lat 
even more difficult to measure them. For e 
should be predicted by course gr 
might master the course conte 
in history nor enter an occup 
is an obvious prerequisite. , m— 
The achievement test is valid if it faithfully represents the impor EA 
behaviors in a performance situation. This is referred to by some an t 
as "circular validity" or lt was said previously that wa 
and assessments generally is dependent - 
alidity in this sense means that the s 
a broad sample of the information and prob £ 
important in a course of training. ting 
hievement Tests. The primary steps in construc de 
an achievement test are those which were outlined in Chapter 8 in yes 
section entitled "Construction of Assessments," Preparatory to the <a 
struction of the test, an outline should be made of the skills which he: 
particular training course is intended to teach. Be b 
personnel that are available for constructing 
tests, the outline can usu 


er behaviors to predict and 
xample, what future eei 
ades in ancient history? The — 
nt quite well yet take no further s 
ation for which knowledge of ancient histor) 


"intrinsic validity. 
validity of achievement tests 
content representativeness, V. 
ment test should concern 
that are thought to be 

Construction of Ach 


cause of the fund n 
most large-scale achieven ? 
ally be spelled out in more detail than is pe þe 
in the individual teacher-made course examination. The outline should in- 
carefully compared with the texts, drills, and lecture content of the ve 
ing course. If the achievement test is to be used widely in differe e 
schools, the outline should be discussed w c ili 
of the major headings in an outline of re 

l. Perception of written material. Th 
the individual letters and words be 


ith numerous teachers. 
ading skills are as follows: ce 
e child must first be able m né 
fore he can read adequately. It is n 
necessary for the child to have a proper balance of eye muscles anc 
coordinate eye movements in reading, ge 
2. Vocabulary, The next essential is for the child to have a knowled$ 
of the words that appear in written material at his level of schooling. 
3. Reading speed. Although, within limits, reading speed is not 


, ‘ : -cader is 
important as the understanding of what is read, the very slow reade 
handicapped in his school work. 


as 
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4. Retention. Before any further understanding is possible, the child 
must be able to remember some of the factual content of what he reads. 

5. Critical reaction. At the highest level of reading accomplishment, the 
child should be able to synthesize what he reads with past knowledge, 
analyze the implications of what is read, and react critically to the 
material. 

6. Auxiliary skills. These include a familiarity with tables of contents, 
the meaning of footnotes and references, acquaintance with library filing 
Systems, and a familiarity with dictionaries and reference sources. 

Each major heading in the outline should be filled out in much more 
detail than the example above (see 5, chap. 7; 15, pp. 272-273 for more 
complete outlines of reading skills). 

After the outline is completed, the ne s 
for each part of the outline. Considerable material has been written on 
the composition of achievement test items (see 2; 15, chap. 4; 17). Some 
of the principal points to consider were outlined in Chapter 8. After a 
large collection of items (an item pool) has been made, a number of 
judges should be used to estimate the clarity, relevance, importance, and 
representativeness of the items for the particular achievement area, In 
Chapter 8, we described some methods of item analysis which are useful 
in selecting those items from an item pool which will produce a better 
achievement test. 

The Content of Achieve 
achievement tests are “objective” ex 


xt step is to compose test items 


ment Tests. Because nearly all standard 
aminations which employ some variant 
of the multiple choice type of item, they have leven a —€— 
ing only the memory for simple factual details. Most teac m uà Ades that 
the memory for details is not the important thing to be prine h om H un 
of instruction. Here in this book, for example, the tander oye — orget 
the names of many of the tests, the specific formulas, ris inr — 
Statistical results which are cited. But it is hoped a he xis pol ds 
the book will be such as to provide the reader with a isin Des n 3 
Ogical scheme, and a way of thinking about human measurement whic 

Will not fade with time. 

The content of multiple choice t 
Viewpoint expressed there being tha 
Simple factual details but that this n i 
siderable empirical evidence and practic ately meaai 
fully constructed multiple choice tests can adequately meas 


ili raw conclusions, and the 
standi n iples, the ability to draw conclu: 1 
anding of complex principles, t of facts to another. The standard- 


e to make inferences from one © e important content than the 
ized achievement tests often embody more p is because there are 
teachers own multiple choice examinations. kw e xil SOS 
AN |: e ructii 
More funds, facilities, and expert services available for the ou 


ests was discussed in Chapter 8, the 
t objective tests often do focus on 
eed not be the case. There is con- 
al experience to show that care- 
the under- 
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of standardized achievement tests. Because the standardized pe 
ment tests successfully use multiple choice items, teachers T = 
assume that all examinations should be of that kind. The poor ee 
structed multiple choice test is often less effective than an essay "ri 
ination. Specific examples will be shown later in the ehapter a 2 
achievement tests can adequately measure the high-level understanding 
of different subject matters. — 

Achievement Test Scores. The most popular scoring procedure 1 1155 
straight multiple choice form in which the question, or stem, is SPE a 
with anywhere from two to more than a dozen alternatives, one of W hi x 
is designated the "correct" response. A number of variants of the pers 
choice form are useful in special circumstances. One of these is the muse » 
ing type of item, in which each of a number of names, for example, ke 
matched with one of a number of accomplishments or events. A typicé 
item is as follows: 


Columbus 


c— 4. invented the compass 
Balboa — M b. sailed around the world 
Magelan |. — č c. discovered Australia 


d. discovered America 

e. invented the steamship 

f. discovered the Pacific Ocean 
The matching question usually has more n 


aree 
ames or terms than the tl 
in the example above. An import 


ant rule is that there should be . 
erably more events or definitions than persons or terms. Otherwise, c» 
ing" plays an important part in the responses. The matching question a 
most useful when there are many names, definitions, and specific iam oy 
be tested. In such cases, it provides a handy way of condensing a mumbe 
of separate questions into one large item, 
Another variant of the multiple choice 


ire the 
type of item is to require 
subject to rank the alternative 


s from most to least correct. In many al 
the “correct” alternative is not completely correct and the “iner” 
alternatives are not equally incorrect. It is reasonable then to think 170 
more information could be obtained from each question by having all 
the alternatives ranked, Although there is something to say for this PI " 
cedure, it offers a number of drawbacks, Principally, it is difficult to 2 5 
struct alternatives for an item in such a manner that they can be logic? e 
ordered from most to least correct. In many questions, the alternatives e " 
either true or false, and there is little basi: ls 
alternatives. Also, items of this kind t iste 
and, consequently, the range of 
restricted if, say, only an hour is 
items are difficult and time 


asis for choosing among the 
ake much more time to admin ely 
items that can be used is severe h 
available for the test. In addition, SUS 

consuming to score, 
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Another variant of the multiple choice item is to have the subject mark 
all of the alternatives which he is sure are incorrect. Like the rank- 
ordering of alternatives this should logically provide more information 
about the subject’s knowledge than the straight multiple choice item. The 
subject’s score consists of the number of incorrect alternatives marked as 
incorrect. The major problem with this type of item is that it introduces 
personality factors to an unknown extent. Some people are more willing 
to take chances than others and consequently to gamble on more of the 
incorrect alternatives. Such predispositions will influence test scores, mak- 
ing them partly measures of personality characteristics, which should 
logically not enter into achievement tests. 

Although it is important to continue exploring new types of test items, 


the weight of the argument is in favor of the straight multiple choice 
form for most purposes. In most cases, it allows the presentation of more 
f item for subjects to under- 


items in a shorter time, is the easiest type © 
stand, and is easier to construct and score. It is usually found that more 
complex scoring systems give results which correlate very highly with the 
straight multiple choice form. 

Achievement Test Norms. 


tests is to compare either a pupil or 
eral educational standards, it is important to have truly representative 
norms. The norms available for many of the standard achievement tests 
are generally superior to those used with other kinds of tests. In many 
ways, the task of obtaining norms is simpler with achievement tests than 
with predictor tests. The normative population to be used with predictor 
tests is sometimes difficult to define. The normative populations for 
achievement tests are more easily specified, e. g. all of the sixth-grade 
students in the United States. Also, because the normative population is 
Contained in specific training programs, subjects m 585 — a for 
testing, and procedures of sampling can be more effectively ape a ^ 

he question often arises as to whether national E or p a 
Norms should be employed with achievement tests. T ae d piers s on how 
the tests are used. If, for example, achievement test scores are being used 
to select students for advanced cl 


asses in a particular city, it would be 
Wiser to use local norms. If the pur 


pose is to compare the amounts and 
kinds of training in different schools across the country, national norms 
a 
should be used. 
de used. ; - T 
Classifications of Achievement Tests. There are : number A different 
Ways of classifying achievement tests 1m terms of their uses. E pend 
distinction is made between tests designed to 1 de E T anly; 
such as the knowledge of gem and € designed » ee a 
= H G H vl n Ji "uv r i 

broad range of subject matters, which might include a anti ii subtest 
along with subtests for literature, history, mathematics, and other sub- 


Because the purpose of most achievement 
a unit of instruction with more gen- 
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jects. They will be referred to respectively as unisubject and meet 
achievement tests. The distinction is a matter of degree because some " 
the unisubject tests, particularly those for language and mathematica 
skills, divide the subject matter into a number of separately tested parts. 

A distinction is often made between survey achievement tests and diag- 
nostic tests. The survey tests are meant to measure how much a person 
knows. The diagnostic tests are meant to tell why a person is or is not 
achieving adequately. Survey tests are the typical kind used to mean 
the mastery of different subject matters. Diagnostic tests are used pri- 
marily in guidance and remedial work to offer clues as to why children 
are doing poorly in certain subject matters. The distinction between sur- 
vey and diagnostic tests is also a matter of degree. There is an unfortunate 
tendency for people to refer to all unisubject tests that measure a number 
of components as diagnostic tests, Examples of more truly diagnostic tests 
will be discussed in the coming sections. 


USES FOR ACHIEVEMENT TESTS 


Assignment of Grades. Teacher-made tests are used more often than 
standardized achievement tests for the assignment of course grades ^ 
elementary school, high school, and college. The advantage for * i 
teacher-made test is that it is tailored to the particular unit of e aro 
It is also the teacher's responsibility to decide what grades should "i 
given. The use of achievement tests for the assignment of grades woul : 
in many c be unfair because of the differences in content of ps 
bearing the same name. In order to keep a semblance of order in An 
recording of school progress, it is essential that courses with similar ee 
have at least a core of knowledge in common. For example, if an pip 
vidual receives a passing grade in plane geometry, it is expected one e 
has a familiarity with certain facts and concepts. However, it is reasona? i 
to expect the instructor to e ers ant 
to include material that 
the standardized 


mphasize certain topics more than oth x 
another instructor might not discuss. Therefor? 
achievement test used for the assignment of scho 
grades might penalize particular students to some extent. e 
Standardized achievement tests are more often used to assign coms 
grades in special school programs, in government, military, and indash g 
training. For example, achievement tests are frequently used in assign” 
grades to participants in technical training schools in the Armed KR 
It is sensible to use achievement tests to assign grades in special schoc 


i 2. ore 
programs for a number of reasons, The course content is usually ™ 
standardized than that in, 


MER US urse 

Say, most colleges. For example, in a C? be 

on the “maintenance of small arms” there are rather definite rules to ne 
: : an 
taught. It is expected that all instructors will teach very much the sa 
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information, and it is not uncommon for all instructors to use a standard 
teaching syllabus to insure uniformity of instruction. The course content 
of many special training programs remains relatively fixed over moderate 
periods of time. Therefore, a standard achievement test may be usable for 
the assignment of course grades over a pericd of several years or more 
with only minor revisions in test content. In many special training pro- 
grams, the instructor is not sufficiently trained in pedagogic techniques 
to assign course grades effectively. Although the instructor may be a very 


good salesman, infantry sergeant, or machinist and also may be a good 
vledge of testing procedures and the grad- 


teacher, he may have little knov 
ing of students. Consequently, it is more meaningful to assign grades on 


the basis of achievement test scores. 
Classification of Individuals. Achievement tests are used more fre- 


quently to test the individual's mastery of a whole range of subject matter 
than to grade his performance in an individual unit of instruction. Multi- 
subject achievement tests are often administered to graduating grammar 
School students preparatory to entering high school. Although the student 
may have good marks in individual courses, it is still uncertain what his 
total educational experience has been. He may have forgotten large 
amounts of the subject matter, oF the particular school may have under- 
emphasized certain subject matters which are necessary in high school. 
Students who do poorly on parts of the achievement test, in grammar, for 
example, can then be Classified into a group needing special preparatory 


instruction during the freshman year of high school. . 
often used to classify students into special courses 


or curricula within grammar school, high school, or college. There is a 
growing trend toward tailoring instruction to the individual. A widely 
used procedure is to assign students to one of several sections on the basis 
of over-all progress shown either by previous tom grades or E achieve- 
x T icc Class! fication, it resources 
Ment tests. An even more logical method of u rwn 1 5 
Permit, is to gear the instruction not only to the stud ent's ye evel o 
achievement but to his own strong and weak points as well. One of the 
multisubject achievement tests is useful for that us en "A 
H inii shieve T ests are useful to th 
d Remedial Training. Achievemen a e 
and the clinical psychologist in understanding the 
: children. The practice in some schools of 
de to the next even if they do very poorly 
ude for school work to get far over 
This will frustrate him as well as 


Achievement tests are 


Counseling an 
School psychologist 
School difficulties of particular 
promoting children from one gra 4 
often permits the child with little aptit 

atters. 


his ] ——" biect m 
head in difficult subJec he hk i 
lower the morale of teachers and students. Achievement tests can be used 


to indicate the generally low level of such a student’s school progress, and 
c B a H r raini ri 7 

this in turn ne suggest placement of the student in a training program 

- o 


more suited to his ability. 


272 Tests and Measurements 


si rimarily ‘Ip in the 
ocu e m ese lire tee il 8 A 
counseling and remedia trainin SEOSIBIENE Y Lm. 
asteri ertain school topics, and particularly to help understa 
ur Mu spei for ne difficulties. If it is tound, „ 
the child understands what he reads but that he reads very slowly, sy 
remedial training can be undertaken to increase reading speed. nu 
The cause of truancy or poor conduct in the classroom is often a 0 e 
by achievement tests. Such disruptive behavior is sometimes relatec 25 $ 
low level of achievement. The student takes out his frustration on th 
teacher and his classmates. A very high score on an achievement test aT, 
also be indicative of a student's difficulties in conduct. The very bighi 
student is often bored by the relatively low level of the curriculum and 
turns to misbehavior or truancy. 
Vocational Guidance. Achievement tests in combination with course 
grades are useful in helping the student to choose future programs o. 
school work or vocational training. They are particularly useful in helping 
students to decide whether or not to go on from high school to college, 
and if so, which kind of college training to undertake. If, for example, 
the student is considering premedical training in college, he should show 
achievement in biology, science, and have a generally high level 9 
accomplishment. There are exceptions of course—the student who does 
poorly in high school but goes on to high accomplishment in college. But 


š Ni à EUM t 
the achievement test scores at least indicate the amount of improvemen 
the student will need to sh 


ow in order to reach minimum standards for 
subsequent schooling. 

Two students with the s 
of high school may have different Strong and weak points in particular 
subjects. The measurement of such differential achievement with multi- 
subject tests Provides a basis for deciding about careers and future 
training. 


j » end 
ame over-all leve] of accomplishment at the en 


Achievement tests used in vocation 
predictors, not as assessments, Test scores are used essentially to forecast 
how well an individual is likely to perform later 
in vocational guidance is relatively safe whe 
on from high school to college or from « 
example, the individual wh 
lege is not likely to meet with succe 
of journalism and English. It is Je 
advise students to go into vocati 
forestry. If the predictive . 


x ated as 
al guidance must be evaluated 4 


Using achievement tests 
n the question concerns going 
lege to graduate work. s 
y in the language arts in co 
Ss in the gr 
SS safe to use 
ons such 


alidity of the 
the vocations in question, the 


the counselor’s judgment. 


0 performs poorl 


aduate school specialties 
achievement test scores to 
as sales work, carpentry, an 

achievement test is not known for 
vocational guidance is as good or as bad as 
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Measuring the Effectiveness of Instructional Techniques. The primary 
reason that many teachers are unfriendly to standardized achievement 
tests is that they are viewed as tests of teaching ability. For example, if it 
is found that the students in a particular school do poorly on the biology 
section of an achievement test in comparison to the scores made by stu- 
dents in neighboring schools, it suggests that the biology teacher is not 
doing a good job. There are three major variables contributing to the 
grades that students make on an achievement test: the initial aptitude of 
the students, the effectiveness of the training program, and the nature of 
the test. These three factors should be carefully considered before blame 
or praise is placed on the teacher alone. 

The teacher cannot bring students to high 
ing in initial aptitude. Large differences are foun 
scores of children in different schools, varying 
ethnic background, and socioeconomic status. : 

It is unfair to use achievement test results to measure the effectiveness 
of teachers if the test itself is poorly constructed. If the test is poorly 
standardized or if the content is generally trivial, it is not the teacher's 


fault that students do poorly. 
If the achievement test is v 
the over-all level of achievement for a 
a low level of achievement in 
matter, The low level of achievement may be due 
classrooms, lack of proper equipment, and to a poorly run school, but it 
cannot be denied that the teacher's individual effectiveness is also promi- 


nently involved. 

The practice of using achi 
instruction is disturbing to some te 
ment test puts a "strait jacket" on : TRS x ode 
teacher to tailor the instruction to the kinds of material in the achiev e- 
ment test. This mav discourage the teacher from experimenting with new 
methods of instruction and with new aspects of the subject matter. It is 
not unheard of for a teacher to coach students on materials so similar to 
those in the achievement test that the effectiveness of the measuring 


instrument is ruined. 
Aside from the use of achiev 


achievement if they are lack- 
d in the achievement test 
with geographic region, 


vell constructed and if account is taken of 
group, the teacher cannot entirely 
a particular subject 


escape the blame for à 
mn part to crowded 


evement tests to measure the effectiveness of 
achers. They argue that the achieve- 
instructional methods, forcing the 


ement tests to measure the effectiveness of 
sts have a real purpose in measuring the effective- 
ness of different kinds of instruction. For example, the Cleese might 
arise as to whether or not a film series would ks dun a instr s of 
geography. If students who have the film ser pee! 1 er P an a6 prd 
ment test than those who participated in a straig w eonte couse this 
offers one type of evidence in favor of the films. Achievement tests are 


teachers, achievement te 
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helpful in determining the effectiveness of many such variants of teaching 
p v tests are useful in the organization of school 5 
Decisions have to be made about the level at which students can e 
particular subject matters and the best sequence for Jemen — 
courses. It might be desirable to offer a course in French earlier vim 
ing than had previously been the practice. If the students who s fom 
course at an earlier age do about as well on an achievement test or = 
much worse than the students who take the course later, this suggests tha 
the course could be given e 

Selection of Individuals. 
tests is to assess curre 


arlier in training. : " 
Although the primary purpose of achievemen 


nt performance, they are also useful as predictors of 
future behavior, Achievement tests can be used to select colle 


office workers, insurance salesmen, and many others. When achievement 
tests are used in this way, they stand on new logical grounds. They are 
then predictors and are as good or as bad as their correlations with meas- 
ures of future school or vocational success, The achievement test that is 
a well-constructed instrument for the assessment of performance in a 


training program may be no good at all as a predictor of future success in 
particular vocations, 


Achievement tests 
entail activities simil 
tests were originally 


ge students, 


are best for predicting success in voc 
ar to those in the tr 


designed, For exampl 
matical skills developed for high school s 
ter as a predictor of Success in college m 
as a predictor of succe 

concern school subject 


ations which 
aining programs for which the 
e, an achievement test of mathe- 
tudents will more likely be bet- 
athematics courses than it will be 
ss in politics. Because most adltevenert tests 


matter, they are most useful for predicting success 
in future educational efforts, Achievement test Scores of graduating high 


school students are generally as effective if not more effective than intel- 
ligence tests for Predicting college grades. When the two kinds of tests 
are combined with high school grades, good predictive efficiency is usu- 
ally obtained. 

Achievement tests are 
specific jobs and professions, particularly 
typical predictor battery is that for stenographers, in which the tests con- 
sist of achievement in typing, shorthand, and other skills pertinent to 
office work. Achievement tests have been used in this way to select indi- 
viduals for a wide variety of jobs, 


used Prominently in the selection of persons for 


in the civil service program. A 


SURVEY ACHIEVEMENT TESTS 

Standard achievement te. 
matters, and there are 
The following se 


sts are avail 
usually competir 
ctions will conside 


able for many different. subject 
ng forms from which to choose. 


r several examples in each major area 
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of achievement testing. The person who is interested in acquiring achieve- 
ment tests for particular purposes should consult the reference sources 
cited in Chapter 18 and should obtain catalogues from the major test 
publishers listed in Appendix 1. 

The multisubject survey tests are generally favored over the separate 
unisubject tests. That is, rather than choose separately constructed tests 
for geography, history, biology, and the other subject matters, it is gen- 
erally better to use a multisubject battery in which the subtests have all 
been constructed and standardized alike. Some ambiguity in comparing 
students on different unisubject tests often comes from the different test 
construction methods used by different commercial firms. An even greater 
advantage of the multisubject battery is that the norms are usually 
Obtained for all tests on the same subjects. 

There are several exceptions to the preference for multisubject batteries 
Over unisubject tests. Several subject matters are sufficiently complicated 
within themselves to require extensive individual treatment. This is so of 
mathematics and reading. Each of these is a unisubject domain in name 
only. There are various functions that need to be tested in each. In order 
to test the extensive skills within each of these general subject-matter 
uv to use such a variety of test materials as to make it 


üreas, it is necess 
impractical in most cases to administer them along with tests for geog- 
raphy, biology, and other subject matters. Another instance in which the 
Unisubject test is preferred over the multisubject batteries is in the testing 
Of special skills such as handwriting and musical accomplishment. The 
testing of these skills usually requires novel testing methods and scoring 
Procedures which are sufficiently different from the typical multiple 
choice test to require separate treatment. , P 
1 Survey Reading Tests. Reading is by far the most important ont to 
Pe tested in school children. The first several grades are devoted largely 
to teaching the child how to read. From that point on, school work is 
almost synonymous with the comprehension of written material. If a child 
aS having difficulty in reading and understanding what is read, it is m 
ant to find it out as soon as possible so that remedial instruction can be 
undertaken, If a school had its choice of only several tests that could be 
it would be wise to use, along with one of the 
etter intelligence tests, a comprehensive test of reading skills. 

The distinction between diagnostic and survey tests is particularly 
tenuous in the measurement of reading skills. Almost all of the reading 
tests measure a number of related functions and, as such, offer some help 
in diagnosing reading difficulties. The reading tests cited in the coming 

placed there rather than here 


ction entitled diagnostic tests are ‘nea’ flos 
lly designed to offer clues about under- 


Used į ; 
Sed in à testing program, 


ee 1 H 
“cause they are more specifica 
Ving reasons for reading difficulties. 
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A typical Survey test of reading is the G 
Separate sections are used to me 
ate general significance; 
(c) reading to underst 
details. Illustrative item 
are available for gr 
amount of time var 


ates Basic Reading Tests gn 
asure abilitv in: (a) reading to AME 
(5) reading to predict outcomes of given ev ne 
and precise directions; and (d) reading to 8 
s are shown in Figure 12-1. Age and grade E 2 
ades 31 through 8. All of the parts are 15 e 
ying with the grade up to 40 minutes in all. Subtes 


Ficure 12-1, Sample items from the 
1957 edition, ( Reproduced by pe 


: h 8. 
Gates Basic Reading Test, grades 314 through 8: 
College, Columbia University.) 


ications, Teachers 
rmission of the Bureau of Publications, Te 


Type GS, Reading to Appreciate General Significance 
Smart as horses are, they do not always 
times want to gallop at top speed, but 
horse running at his top speed is out of 
be. Unless a horse has been trained as 
strain on his delicate legs. 
Draw a line under the w 
slow top-speed 


They some- 

at is good for them. 1 cre A 
: E ` 0 it. 

a good rider will never let them e 

control, just as 


ar would 
a high-powered car w 
a race horse, 


1 P ^at 
lop-speed running puts gre 
"ord that tells wh 
easy controlled 
Type ND. Reading to N, 
Next morning she 


A DNUS a horse. 
it kind of running is bad for a 
moderate 
ote Details 


awoke and found herself in 
covered with silken curtains. There were 
was made of ivory. The cove 
dress and a pair c 
with diamonds, 
Where did the 
barn room 
What were the 


à beautiful room. The walls d 
two mirrors made of pure silver. T as ; " 
rings were made of silk and velvet. By her bed ed 
of slippers. The dress was made of silk. The slippers were cove 


girl find herself? 
garden store 
mirrors made of? 

silver gold pearl silk 

What were on the slippers? 

rubies pearls opals diamonds 


arate 
reliabilities Tange from .80 to .96 for different grades. Although sepa 
Scores can be obtained from th 


1e four subtests, it is better to average 
as an over-all measure of reading ability. The correlations among . 
subtests are too high to permit the use of profile difference joe xce] 
for large differences in subtest scores, score differences would be 
due to chance, 


Two other widely used Survey reading tests are the Nels 
Reading Test: Vocabul 


English Tes 
ing subtests 


gely 
argely 


on-Denny 
ary and Paragraph (12); and the Cooperative 
t (19) Many of the multisubject batteries also include read- 


urvey Batteries for Elementary School. The multisubject batteries find 
thei 


T widest use in the testing of elementary school children, Achieve- 
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ment tests have a special importance there because so much of the child’s 
education lies ahead of him, and it is important to learn his strong and 
Wenk points as early as possible. Because of the greater uniformity of 
Subject matter in elementary school, it is easier to compose a comprehen- 
sive battery of achievement tests for this level than for higher levels of 
education, i 


Ficurr 15 A , : 
(Re RE 12-2, Sample items from Metropolitan Achievement Tests, Advanced Battery. 
Produced by permission of World Book Company. ) 1 


Vocabulary 


Sh > Jg le. H 3 Y 
€ replies means she—1. complains 2. depends 3. fills 4. answers 


Arithmetic fundamentals 


What per cent of 24 is 9? 


Arithmetic problems 
At that price, what did one rose 


N 3 
d bought one-half dozen roses for $1.68. 

cost? 
English: Part II Punctuation and Capitalization 


Put i r 3 
ut in the capital letters and commas, periods, and other punctuation marks that 


have been left out. 
Docs Carl want the candy or the fruit he prefers candy but he likes fruit too. 


Literature 


Friday 
riday was a faithful servant of 


l. Tarzan 2 Huckleberry Finn 3. Robinson Crusoe 4. Robin Hood 


Social Studies: History and Civics 


Western E 3 : 
estern Europe became interested in exploration because— 
I. the feudal 2, many schools were opened. 


3 stem disappeared. 2, 
3. of a desire for new trade routes. 


4. the Church favored it. 


Science 


In a t 
the lungs, the blood gains a supply of— 
4. oxvgen 


l. argon 2, nitrogen 3. carbon dioxide 


Ms bin series of achievement tests for the elementary level is the 
blige ien Achievement Tests (9). The series includes four separate 
vnu from Primary I Battery for the first grade to the Ad- 
Yeading eG for grades T and 8. The Primary I Battery includes three 
1 1 one numerical test. Each of the three advanced batteries 
metic Me wa reading, vocabulary, arithmetic fundamentals, arith- 
ing. The fos janua geography, English, literature, science, and spell- 
iritiye (tenis). hn de end to be power tests (see Figure 12-2 for illus. 

ry can be given in from one to four hours. Either 
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four or five equivalent forms 
reliabilities for the subtests rar 
Two other widely used multis 
ment Tests (10) and the C 

High School Achievemen 
teries for high school are 


“half 
are available for each battery. 2 
nge from .80 to .97 for —— rp 
ubject batteries are the Stunfora i 

alifornia Achievement Tests (16). dest bate 
t Tests. Comprehensive SPURS HII s alee 
more difficult to construct than are 5 7 same 
mentary school. Whereas all students tend to study many ae elective 
topics in elementary school, they often diverge widely in ek jus topics 
topics in high school, Also, the curriculum spreads out to vas Oe is ele 
such as mental hygiene and civics, which are usually not pees 
mentary school, Consequently, the available achievement test ;h school 
tend to capitalize on the core subject matters that underlic nig d Seong 
training, particularly literature, mathematics, general science, an schol 
aspects of social science, The scores that a student makes on hig 5 818 
achievement batteries should be considered in the light of the sp 


in par- 
i rades made in pê 
course of study which he has undertaken and the grades mad 

ticular units of instruction, 


; ral 
A typical high school achievement battery is the Cooperative etes of 
Achievement Tests (20). The three subtests concern broad knowled p is 
mathematics, and natural science. Each ps PE 
arts, the first part being concerned with a know 


the social sciences, 
divided into two p 


5 sts. (Re- 
Ficunr 12-3, Sample items from the Cooperative General Achievement Tes 
produced by permi sion of the Educational T ting Service.) 


Social Studi 


es: Terms and Concepts ibre 
5 imarily b) 
nited States attend schools that are supported primarily 


Most children in the Ur 
i contributions from churches, 
2. the federal government. 

3. private endowments. 

4. local and state t 


axation, 
5. tuition p 


aid by pupils, 

A reciprocal trade t hich— 
both political parties compromise, 

2. the rates are low. 

3. exclusion of f, 

4. all sections o 

5. twon 


ariff is one in w 


oreign goods is sought. 
f the nation benefit nearly equally. 
ations modify their rates on e 


A typical daily news 
l. 


ach other's goods. 

paper derives the largest part of its income from 
Subscriptions and newspaper sales. 
2. advertising by commercial establishments. 


3. payment for Space by news-gathering agencies. 
4. want ads, 


> i tc. 
donations by political parties, labor unions, etc. 


Ficure 12-4, Sample items ° from the Watson-Glaser Critical Thinking Appraisal. 
(Reproduced by permission of World Book Company.) 


EXAMPLE. SraTrEMENT: “We need to save time in getting 
there, so we'd better go by plane.” 


Proposen ASSUMPTIONS: 


1. 


1 


Going by plane will take less time than going by some 
other means of transportation. (It is assumed in the 
statement that greater speed ofa plane over other Toeuns 
of transportation will enable the group to get to their des- 
tination in less time.) 


- It is possible to make plane connections to our destina- 


tion. ( This is necessarily assumed in the statement, since, 
in order to save time by plane, it must be possible to go 
by Plane „ 
Travel by plane is more convenient than travel by train. 
(This sumption is not made in the statement—the state- 
ment has to do with saving time, and says nothing about 
convenience or about any other specific mode of travel.) 


EXAMPLE. Some holidays are rainy. All rainy days are bor- 
ing. Therefore— 


l. 


to 


No clear days are boring. (The conclusion does not fol- 
low, as you cannot tell from these statements whether or 
not clear days are boring and some may be.) 


Some holidays are boring. (The conclusion necessarily 
follows from the statements, since, according to them, the 
rainy holidays must be Boring) temet 


Some holidays are not boring. (The conclusion does not 


follow from the statements even though you may know 
that some holidays are very pleasant.) ))) 


EXAMPLE, Should all young men go to college? 


1. 


Ves; college provides an opportunity for them to learn 
school songs and cheers. (This would be a silly reason 
for spending years of one's life in college.) 


No; a large per cent of young men do not have enough 


ability or interest to derive any benefit from college train- 
ing. (If this is true, as the directions require us to assume, 
it is a weighty argument against all young men going to 
college. ) 


No; excessive studying permanently warps an individual's 


personality. (This argument, although of great general 
importance when accepted as true, is not directly related 
to the question, because attendance at college does not 
necessarily require excessive studying. 


TEST 2 
ASSUMPTION 


MADE NOT 
MADE 


1 


TEST 3 
CONCLUSION 
FOL. DOES 
LOWS NOT 
FOLLOW 


TEST 5 
ARGUMENT 
STRONG WEAK 


TEER 


° The explanations in parentheses do not appear in the items of the test proper. 
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5 inter- 
. rehension and ir 
of terms and concepts and the second part with comprehension a 


; ‘hool 
pretation (see Figure 12-3 for illustrative items). Other (Lowe 
achievement batteries are the Iowa Tests of Educational natia dut 
(11), The Essential High School Content Battery (8), and the C: 
Achievement Tests (16). 


ity in 
" 5 S 5 " > ingenuity 
Achievement Tests for Higher Education. Considerable ing 


> training. 
test construction is required to measure the products of _ fera 
graduate study, and professional work. Students can take 9 55 achieve 
‘sets of courses that there is little overlap in training. Therekore d of spe- 
ment tests for higher education are best aimed at particular limes survey 
cialization, such as engineering. It is more hazardous to i a college 
batteries that are Meant to compare all students in their genera an 
progress, A systematic battery for the dive: 
take would be a very large-scale 
subtest for “science” it would be 
physics, chemistry, and biology, 
lumped together at lower leve 
at the higher education level, 
One of the most widely 
level is the Graduate 
The battery include 
matter tests fe 


rse courses that students ade 
instrument. Instead of having SH 
necessarv to have separate eS 55 
Other related subjects which e 
Is of training should be tested separatel} 


oge 
used achievement test batteries for the idis 
Record Examination ( Educational Testing pe feat 
s tests of verbal and mathematical ability and ul 
or physics, chemistry, biology, social studies, seh to 
fine arts. The Separate tests are sufficiently long and well standarc an of 
Provide reliable Scores. The battery is usually administered at the oe 
college training. It has proved useful in guidance of college stuc 
selection of graduate students, and research on college curricula. 7 7 
In addition to their use for the measurement of progress in eit ae 
grams, achievement tests can be used to test some of the —— qe as 
of higher education, Instruments of this kind are usually me de 
tests of critical thinking. A typical example is the 1 d with 
Thinking Appraisal (18). The five parts of the test are 1 CODEC 
the ability to recognize assumptions, make inferences, deduce e 124 
quences, interpret Statements, and evaluate arguments (see Figur iltiple 
for illustrative items). All of the materials are presented as mi 


thought 
choice items. The test illustrates the measurement of complex g 
processes with objective items. 


DIAGNOSTIC ACHIEVEMENT TESTS 

As was said 

tic” is a matte 
to measure 

these more r. 


iev is “diagnos- 
previously, whether or not an achiev ement test 0 855 s d 
r of degree Some achievement tests make particular e d 
i 1 in ¢ ject-ma area, an 
the constituent skills involved in a eq de matter area, 
i i 8. 
ightly earn the name of diagnostic tes 
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à Diagnostic Reading Tests. Diagnostic tests have assumed their most 
important place in the measurement of reading skills. This is because 
reading is a very important part of school achievement and because the 
attributes which underlie reading skill are complex. A typical diagnostic 
reading measure is the Iowa Silent Reading Tests (6, 7). The series 
includes an elementary battery for grades 4 to 8 and an advanced battery 
for high school and college. The advanced battery includes the following 
subtests; 5 
Test 1. Rate of reading and comprehension of prose 
Test 2. Directed reading of prose to answer particular factual ques- 
tions 

Test 3. Poetry comprehension 
Test 4. Vocabulary in different content areas 
Test 5. Sentence meaning 
Test 6. Paragraph comprehension 
Test 7. Location of information using an index 

Another approach to the measurement of reading skills is through the 
Use of standard oral passages. The child is asked to read aloud a series of 
raded passages. The Standardized Oral Reading Paragraphs is a form 
which has been widely used (4). While the pupil reads each passage, the 
examiner uses a standard code to indicate the number and kinds of mis- 
takes which are made. The examiner looks for mistakes like the follow- 
ing: 

l. Mispronounced words 

2. Mispronounced vowels 

3. Repetitions: reading the same word or phrase over before going on 


. Omissions: leaving out a word or phrase 

. Substitutions: saying a different word from that in the passage 
Insertions: adding words 

Reversals, such as saying “no” when the word is “on” 

n addition to the noting of various kinds of mistakes, the examiner has 
an opportunity to observe the child while reading. This often provides 
Some clues about reading difficulties, such as stuttering, extreme shyness, 
and visual or auditory defects. 

i Diagnostic Tests of Mathematical Skills. Mathematics is second to read- 
ing as an area in which diagnostic tests are most needed. A widely used 
battery is the Compa Diagnostic Tests in Arithmetic (14), comprising 
3 Series of group tests suitable for grades 2 through 8. Addition, subtrac- 
tion, multiplication, and division are broken into a number of underlying 
Skills, and each is tested separately. The test might show, for example th it 
a particular child habitually misplaces the decimal point in long ee 
has difficulty in carrying figures from one column of addition 15 aoti s 
and fails to line up properly the steps in multiplication. 5s 


4 
5 
6. 
T 
I 
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s edly a 
Appraisal of Diagnostic Achievement Tests. There is P oe 
at need for diagnostic tests to help understand Lag naui vein d 
in school achievement, In order to have an adequate SE a 
essential, among other things, to obtain reliable score profiles: . balter 
urement of differential achievement is the object of the m ‘which 
Unfortunately many of the diagnostic batteries employ ae 
have low reliabilities and high correlations with one another pe 10 
quently, score differences within the batteries are unreliable. js urposes 
have the broad Coverage of skills that is necessary for diagnostic p liable; 
and in order to make each subtest long enough to be individually r 


f ime-consuming 
a thorough diagnostic test must necessarily be a long, time-con 
measure to apply, 


Tests and Measurements 


REFERENCES 


l. Adkins, D., et al. Construction and anal 
ton: GPO, 1947, 


»w York: 
Bean, K. L. Construction of educational and personnel tests. New 
McGraw-Hill, 1953. 


85 Gates, A. I. Gates Basic Reading Tests: 
Teachers College, Bur. Publ., 1943. 

4, Gray, W, S. Standardized Oral Reading 
III. Public School, 1915. 

5. Greene, E. B. 
Odyssey, 1952. : : d- 

6. Greene, H. A., et al. Iowa Silent Reading Tests, New Edition (Rev.) A 
vanced Test: Manual of Directions. Yonkers, N.Y.: World, 1943. „ Edition 

7. Greene, H. A., and Kell V. H. Iowa Silent Reading Tests. New Would, 
(Rev.) Elementary Test: Manual of Directions. Yonkers, N.Y.: 
1943. 


;. Washing- 
ysis of achievement tests. Washing 


to 


; York: 
Manual of Directions. New Y 


ington, 
Paragraphs: Manual. Bloomingto 


3 York: 
Measurements of human behavior. (Rev. ed.) New 


1175 

i Durost, W, N. Essential High School Content Battery 

Manual of Directions. Yonkers, N.Y.: World, 1951. Interpre- 

9. Hildreth, C., et al. Metropolitan Achievement Tests: Manual for Ir 

s, N.Y.: World, 1948. ; Ble- 
» etal. Stanford Achievement Test: Manuals for lee Ee 

mentary, Intermediate, and Advanced Batteries. Yonkers, N.Y.: 

1953. 


y / al. 
1l. Lindquist, E. F. Iowa Tests of Educational Development: General Manu 
icago: Science Research, 1948. 


; N y 

12, Nelson, M. nd Denny, E. C. Nee Roney, Reading Test: Vocabulary 

and Paragraph, Boston: Houghton Mifflin, 1938. liffs, 

13. Ross, C. Bla na today's schools. (3rd ed.) Englewood Cliffs 
N. I.: ice- z " he i- 

14. 8 b FE [^M Diagnostic Tests in Arithmetic: Manual. Chi 
cago: Scott, Foresman, 1925. 

15 


Thorndike, R L., and Hagen, E. Measurement and evaluation in psychol- 
gy and education. New York: Wiley, 1955. 


16. 


19. 


Achievement Tests 283 


Tiegs, E. W., et al. California Achievement Tests: Manual of Directions 
for Primary, Elementary, Intermediate, and Advanced Batteries. Los An- 
geles: Calif. Test Bureau, 1951. 


18 Travers, R. M. How to make achievement tests. New York: Odyssey, 1950. 


Watson, G., and Glaser, E. Watson-Glaser Critical Thinking Appraisal: 
Manual. Yonkers, N.Y.: World, 1952. 

Cooperative English Test: Manual. Princeton, N.J.: Coop. Test Div., Educ. 
Testing Serv., 1953. 


- Cooperative General Achievement Tests: Manual of Directions. Princeton, 


N. J.: Coop. Test Div., Educ. Testing Serv., 1954. 


CHAPTER 13 


Opinions, Attitudes, and Interests 


Democracies stir and change in re 
feel. Politicians avidly inspect the 
the public sentiment toward 


action to what the people ae aes 
current opinion poll results to ee 
arms reduction, farm subsidies, anc ee 
policies. Commercial concerns spend millions to learn the kinds of ud 
hold goods which are desired by the average housewife, Teen-age pn 
make and break entertainers in rapid succession, Prejudice of one Sui 
toward another presents a barrier to the democratic ideal and a hundi in 
years of rancor and protest result. What the individual wants to do A 
life determines in good measure the vocation he enters. As a partial a 
quence, the nation finds itself with a shortage of teachers, nurses, a 
engineers and a surplus of farmers, lawyers, and actors. In these and me 5 
other Ways, opinions, attitudes, and interests are important factors 
everyday living. 
Opinions, attitudes, 


, E f TS » indi- 
and interests are prominent influences in the 
vidual's socia] 


adjustment. If the husband likes Beethoven and pie o 
prefers Sinatra, if one is a Republican and the other a Democrat, i a 
is prejudiced against Negroes and the other is not, disagreement a " 
hostility are hard to avoid. The individual's family and friends have mE 
own attitudes and Opinions, and the choice between conformity and an 
flict is difficult to make. Personalities can disrupt when interests do no 
match abilities or when strong attitudes conflict with one another. » 
The four previous Chapters were primarily concerned with 7 
ability, with how much a person knows or how well he can E 
lems. This and the following chapters will be primarily concerned Min 
personal reactions, or what are variously called social traits, e 
attributes, and habitual performance. These human characteristics 5 
often included in the more general term “personality, T it is N 
istinguish attitudes, opinions, and interests from the kinds of personality 
attributes which will be discussed in later chapters. 19 p 
ne devices which are used to study attitudes, opinions, and interests 
often referred to as “tests,” but they are not tests in the same sense as 


284 


are 
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a vocabulary test or a mechanical comprehension test. In an aptitude test 
the subject cannot control his score, except to cheat or to make a low 
score on purpose. The most socially approved behavior is to make the 
highest score possible, and the individual's ability limits the score which 
is made. This is not at all the case with most measures of attitudes, 
opinions, and interests. The measurement devices are more properly 
referred to as questionnaires or inventories. They are standard procedures 
for obtaining information from the individual about himself. Most of them 
àre dependent on the individual's knowledge of himself and his willing- 
Dess to tell what he knows. 

The individual knows many things about himself which would be diffi- 
i king him directly. However, the in- 


cult to learn in any way other than as 
dividual is not always an accurate observer of his own behavior nor a 
willing reporter. This is particularly the case when the questions concern 
Potentially embarrassing issues such as sexual adequacy, honesty, and 
personal courage. Depth psy chology has taught us that the individual 
often hides unpleasant things from himself through the mechanisms of 
Projection, repression, and rationalization. Even if the individual is aware 
of personal deficiencies, he is likely to hide this information from others. 
Emotionally laden issues predominate in the personality inventories and 
Cause the subject to "fake" responses rather than give accurate answers. 
, and interests are sometimes affected by 


nventories for opinions, attitudes 
the individual's need to give socially acceptable responses but not nearly 

s c s ; E 5 TUNE 
50 much so as are the personality inventories. ( The problem of *faking 
Will be discussed at greater length in the next chapter.) 

c T * " ro gr. 

Relationships Among Opinions, Attitudes, and Interests. There are no 
Sharp dividing lines among the three types of instruments to be discussed 
in this chapter. Thev are all inventories dependent on the individual's 
reporting of what he thinks and feels. However, it is necessary to distin- 
Suish them as well as possible, because they are constructed and used 
Somewhat differently. 

It is easiest to begin by distinguishing intere 
tudes. Interest inventories are meant to measure what the individual does 
and does not like to do. A typical interest item is, “Do you like to work 
out of doors?” to which the subject answers either “yes” or “no.” Although 
terest inventories can be constructed for activities of all kinds, they are 
Primarily concerned with activities that distinguish jobs and professions 
from one another. 

Opinions and atti -oncern how the individual reacts to the people, 
, d attitudes c I 
Institutions, and ideas in the world about him. The term "opinions" is used 
more often to refer to judgments and knowledge, whereas the term “atti- 
tudes” is á 2X s „ e 
M de is more connotative of feelings and preferences. Opinions are usu- 
ally more verifiable than attitudes. That is, the opinion "There are more 


sts from opinions and atti- 
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Protestants than Catholics in the United States” is open to direct oe 
and verification, Using the term "opinion" in the stricter sense to E adi 
verifiable statements, it is possible to determine whether or not scent 
vidual’s opinion is true or false, An example of an attitudinal sta P it 
is, ^I dislike capital punishment.” This is a statement of feeling, : 
does not make sense to inquire whether it is true or false. ects eiit 
Although it is possible to distinguish opinions and attituc " que 
extreme examples, the two are usually mixed to some extent in MPS : ell. 
tionnaire studies. For example, the statement “Negroes have lower A but 
gence than white people" is open to study and as such is an 1 agn 
it also relates to how an individual feels about Negroes and is und 
quently an attitudinal statement as well. Our feelings about most Le ihe 
udgments to some extent. Because 9 Tien 
of reactions, the term "opinions" is ott 15 
and judgments, or the term “attitudes 
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used to refer to both feelings 
often used to cover the two. 
Aside from the Psychological functions that 
Opinions and attitudes, the te 
ent kinds of studies. The 
opinion poll, which invol 


are involved in studies i 
rms are usually employed to refer to cap 
term “opinions” has become associated with d 
ves the sampling of responses from a large anc 
broad cross Section of the population. The questions are usually bs! 
simple, such as, "Will you vote for Jones or for Smith?" and in mos 
cases require only a “yes” or “no” 
are employed, they are either uncon 
manner so that they 
interviewed on the s 


answers the question 


nected or not connected in a logical 
add together to form 


treet while they wait 


S while warming the b 
Attitude-measurement devi aboratory instruments. 
They often involve complex r: is usually necessary to 
itude scales are seldom 


are more often employed 
to study particular grou as the Separate items in an 
opinion poll do not nece. à more genera] measure, this is 
the essence of the attitude-measurement techniques, The purpose in the 
study of attitudes is to locate the individual on a continuum, or scale, 
which ranges from very unfavorable to Very favorable in respect to the 
object being rated. 


OPINION POLLS 


Opinion polling, like most other things, appears 


; ] quite simple to the 
individual who makes his living doing something else, On the surface it 
seems that all that is necessary is to ask some questions of 


a large number 
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of persons. In fact, the adequate measurement of opinions is an exacting 
business; and unless the polling is undertaken with considerable care, 
the results may be very misleading. Considering the number of ways in 


which opinion polls can go astray, it is surprising that the polling results 
have been as accurate as they have during the last thirty years. 

The opinion poll is an outgrowth of the “straw vote,” whose purpose 
is to sample voting sentiment before an election. One of the earliest 
recorded “straw votes” was that of the Harrisburg Pennsylvanian in 1824. 
Reporters went out to inquire whether the citizens in the surrounding 
region would vote for Henry Clay, Andrew Jackson, John Quincy Adams, 
or William H. Crawford for President. The straw vote grew steadily as 
àn aspect of journalism and gradually extended to topics other than vot- 
ing behavior. By the early 1930s a number of magazines, particularly 
Literary Digest and Woman's Home Companion, were involved in ex- 
tensive polling activities. 

A now famous mishap focused public interest on opinion polls and 
Pointed out the need for more systematic methods. The Literary Digest 
made a very bad estimate of the voting sentiment in the 1936 presidential 
election (see 1, chap. 10). The magazine predicted a Landon victor 
With 370 electoral votes, but Roosevelt won 523 of a possible 531 votes. 
This calamitous mistake drove the magazine out of business. 

A look at how the Digest poll operated will point up some lessons about 
Opinion polling. The poll was conducted entirely by mail, with 10 million 

936 election, of which less than 2% million 


allots being sent out for the 1 
Were returned. The voters who reply in this way are usually a select 
group, overweighted with persons from the upper-income and upper- 
educational levels. People who reply to mail questionnaires usually have a 
More direct interest in the election than do those who fail to reply. In 
this case it was the upper-income individuals who were protesting against 
Roosevelt, 
Worse than the dependence on mail-in ballots, the Digest used very 
Poor sampling procedures. The mailing list was obtained from telephone 
irectories and automobile registrations. The people in 1936 who had 
telephones and automobiles were, as a group, much higher on the socio- 
economic scale than those who did not. At that time the lower socioeco- 
nomic group voted heavily Democratic and the higher socioeconomic 
Sroup voted heavily Republican. The Digest predicted that Roosevelt 
Would get only 40.9 per cent of the popular vote, whereas he actually got 
60.2 per cent. i 
as 3 Digest fiasco taught pollers a lesson that they will never 
1886 we m look at the Digest polling results for the years before 
lecum u i have shown the unrepresentativeness of the results. In the 
?n predictions for the years of 1916, 1920, 1924, and 1928, the polls 
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showed avera 
(1, p. 180), in 1935, 
The first systematic polling organization in this country began in ie 
when George Gallup formed the American Institute of Public p 
Knowing the nature of the Digest sample, he was able to predict | thé 
closely the amount by which their results would be in error before 


F > n " rs have grown 
ballots were even mailed out. Gallup's organization and others have g 


i » Fortune 
In strength since that time. Other leading polling agencies are the Fort 
Survey (Elmo Roper), the Crossle 


d arch 
y Poll, the National Opinion s 
Center, and the Office of Public Opinion Research (Princeton Uni 
sity). 


During the last twenty years the polling agencies have grown in 155 
and influence and have branched out into many areas besides the 1155 
diction of elections. Polling activities increased sharply during blic 
Second World War as a means of keeping a constant ear on pua 
opinion. This provided the government with a considerable amount ¢ 


a : "me ; ice 
information about the sentiment for Joming the war, reaction to pr 
controls, faith in our allies, and many 


respectively 
8e plurality errors of 20, 21, 12, and 12 per cent respec 


other pertinent issues. 


ES 
TABLE 13-1. PREDICTIONS oF VOTING Perc TAGES FOR WINNING CANDIDATE 
ComParep WITH THE NATIONAL Vorn 
(Adapted from Albig, 1, p. 212) 
Poll Election 

1936 ° 1940 1944 1948 1952 1 
( Roose- ( Roose- ( Roose- ( Eisen- 
. velt ) velt) velt) (Truman) hower) 

Literary Digest 40.9 11555 ee 
Fortune (Roper) 61.7 53.6 415 57.0 
American Institute of j 


Public Opinion (Gallup) 
Crossley Poll 
National vote 


? Percentages of total vote including that for minor candid 
1 Percentages of two-party vote only. 
] Percentages of two-party 


ates, 


vote with Proportional division of “undecided” 


votes. 


Market research is the major activity of polling 
mercial firms spend millions of dollars to learn what the Public thinks 
about new products and what type of Person is most likely to buy an 
article, Special polls such as Hooper, Nielsen, and Trendex report the 
audience reaction to radio and TV Programs, and the r 


agencies today. Com- 
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tremendous influence on the pattern of entertainment in America. Opin- 
ion polling has also become a new type of journalism. Polls are conducted 
Solely to stir the public interest. using questions like, “Should the wife 
or the husband manage the family finances?" and “Would you vote for 
a woman President?" Poll results now appear regularly in hundreds of 
American newspapers. The election predictions have remained the show- 
Case of opinion polling, and the predictions have increased in accuracy 
(see Table 13-1 for some election predictions ). 

Population Sampling for Opinion Polls. The most accurate way to 
determine public opinion is to ask questions of every individual in the 
Population. A complete census of this kind is extremely laborious and 
expensive. The 1940 United States government census cost $50,000,000, 
and more recent census studies were even more expensive. Consequently, 
it is necessary to deal with samples of the population as a way of estimat- 
ing how the total public reacts to an issue. 

The results of an opinion poll based on 
Useful only to the extent that they closely resemble the results that would 
e obtained from the whole population in question. This will be the case 
when the sampling is precise and accurate, in the senses in which the 
terms will be used here. A sampling procedure is accurate if it is un- 
lased; i.e., if the sample is not overbalanced with persons of one kind 
9r another, If 20 per cent of the individuals in the nation hold an opinion, 
the results of an accurate sampling procedure will, as the number of 
individuals polled grows larger and larger, approach the 20 per cent 
figure, An example of a biased, or inaccurate, sampling procedure was 
Siven on the previous pages in respect to the 1936 election predictions 
Py the Literary Digest. If they had gathered larger and larger numbers 
s « and automobile registrations, they 


a sample of the population are 


of persons from telephone directories 
Would still have ended up with the wrong prediction. 
The precision of a sample is determined directly bv the number of 
Dersons sampled. If the sampling procedure is accurate, the true popula- 
tion results will be approximated more and more closely as the number 
of individuals in the sample is increased. Statistical formulas can be used 
to determine the likelihood that the population results will differ from 
the sample results by particular amounts (the formulas are discussed in 
Appendix 6). For example, if in an unbiased (accurate) sample of 1,000 
persons, 55 per cent say that they would vote for candidate A, the odds 
are less than 1 in 1,000 that fewer than 50 per cent of the entire popula- 
19 5 would, if polled, say that they would vote for A. If we could estimate 
gs in our daily life, such as whether or not it will rain tomorrow, with 
odds of only 1 in 1,000 of being wrong, we would feel very comfortabl 
in making decisions, Š à f Ro 


TH s : ; n 
ie cost of polling tends to go up arithmetically with the nüniber 
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sampled. That is, if it costs $2,000 to poll 400 people, it will iege 
$4,000 to poll 800 people. The precision of the sample 1 ae 
portion to the Square root of the number of persons polled. mm 
takes a sample of 1,600 persons to double the precision of a 4 ensi 
sample. There is a point of diminishing returns, where the „ 
Precision becomes unimportant in comparison to the . e 
This is why most large-scale polls use samples that generally jer EN 0 
tween 3,000 and 5,000 persons. With sample sizes this large, the sae 
imprecision alone is not likely to make the results different by more 
a percentage point or two from the actual population values. .eldom 
The erroneous results sometimes obtained in opinion polls are So 3f 
due to lack of precision (sample size). Literary Digest in the clectio 
1936 received over 2 million straw votes. Pre 
point for an adequate sample 
conducted on less than 100 p 
ion. But even if the 
biased and provide misleading results, 1 TE 
Sampling Techniques. Sampling procedures were discussed briefly ct 
Chapter 3 in regard to the gathering of testing norms. Here the subje 
Will be treated in more detail because of the dir 1 LE 
pling techniques for opinion polls (see 20 for à thorough discussion 4 
sampling methods). The population to be sampled in opinion polls gos 
be, as the term is popularly intended, the entire adult United Sta of 
citizenry, or it may be limited to the population of a certain state 1 
locality. Sometimes the Population being sampled consists of all 108 
members of a particular Sroup, such as all of the members of a particula 
labor union, or all of the Catholics in the United States. "i 
The most Obvious and in many ways the ideal method of sampling 1S 
to draw a number of persons completely at random from the population. 
a sample of 1,000 persons from a ] 


ion is a necessary ae 
» and except for a few studies which «in 
ersons, most polling results have high s 
sample is highly precise, the sample may 


sam- 
ect importance of sal 


F ] 18 ng 
If, for example. abor union is being 
> 
studied, the individuals could be randomly chosen from union T 
(Tables of random numbers which can be found in many statistics boo 


are useful for this Purpose.) Random sampling is a comp 
or unbiased, procedure, 


If there are 
carpenters, 


letely accurate, 


obvious subgroups in the population such 
and machinists, there is a way of improv 
sample. This can be done by ensuring that the subgrou 
sample 


as plumbers, 
ing the random 


PS appear in the 
in proportion to their numbers in the population, That is, if 20 


per cent of the union members are 1 it eo e asure that 
20 per cent of the sample is composed of plumbers. Each of the sub 


groups 
—plumbers, Carpenters, and so on—is referred to as a nora Ideally the 
Proper number of persons should be chosen randomly Within the 
stri 


atum. That is, in the example above, the specified number of Plumbers 


Opinions, Attitudes, and Interests 291 


could be randomly drawn from lists of plumbers only. A sample obtained 
in this way is called a stratified-random sample. 

The random and the stratified-random are the most accurate types of 
sampling available for general use. However, they are too expensive to 
undertake in many situations, particularly in national polling. A satis- 
t an area sample. In this case, 


factory approximation can be obtained by 
dwellings are sampled instead of persons. A city can be laid out into a 
large number of square areas. A number of the areas can be drawn at 
random, and a number of addresses can be randomly chosen within each 
area. Opinion pollers can then contact as many designated persons as 
can be found and will answer the questions. The area sample is ideal for 
Measuring the opinions of the individuals in one city. If the sample is to 
represent a regional or national population, there is the additional chore 
of choosing repre sentative towns, cities, and rural areas to study. 

Even the area sample is too difficult to obtain in many studies, particu- 
larly in the weekly polls on current issues. It is necessary to resort instead 
to what is called a quota sample. The quota sample makes no effort to 
select people randomly. Instead, persons are deliberately chosen because 
of their individual characteristics. The effort in the quota sample is to 
Construct a miniature of the entire population in terms of known demo- 
graphie characteristics. The characteristics which are most often con- 
race, urban-rural, age, sex, and income. 
say, Memphis, Tennessee, is told 


tolled are geographic region, 
The individual interviewer, located in, 
the number of people that he should poll in each of the above categories. 
He must, for example, ensure that about 10 per cent of his sample is 
Negro and the remainder white, that 30 per cent of the subjects are from 
rural districts, that half are above forty years of age, and so on for the 
ristics. Little effort is made to balance out 


other demographic characte 
as to ensure that half of the urban 


the cross categorizations of persons, SO 
Subjects are women, and so on for more complex combinations. Some of 
the categorizations depend on the interviewer's judgment to an extent. 
Even though the distributions of demographic characteristics for the in- 
dividual interviewer may only roughly match the percentages of such 
Persons in the United States population, these inconsistencies tend to 
Average out as the results are combined for many different interviewers. 
Spot-check Samples. In many studies it is desirable to get an indication 
of national opinion without going to the time and expense of a large- 
scale national poll. Sometimes used for this purpose are small local 
samples, which are selected in such a way as to share approximately the 
pi demographic characteristics as the total population except for the 
se i OE e sce fom amah oa samples are compared 
with national solls 5 » ie 5 ixi wie d ih al Bs ok Amal cel 
polls. For example, in one study the local sample consisted 


29 


to 
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of 264 Persons all of w 


; n 
TS dos Princeto 

hom were obtained within 40 miles of 
University, The inte 


in the 
rViewers were instructed to choose Lac tm ae 
upper-, middle-, and lower-income brackets in the ratios 1:4:5. No $) 
tematic effort was made to bal 
The same questions that were 
porated in a large-scale nation 
from the 


istics. 
ance other demographic eharte 
asked of the local sample ya ed 
al poll. Very similar results were Austra- 
local sample and the national poll (see Table 13-2 ior rns, the 
tive results from one of the questions). However, as Cantril ae moti 
Tepresentativeness of results from small local samples depends ancl 
on the questions which are asked, If the questions are related to cmb 
local issues and feelings, the results are likely to be quite mis 1 i 
Such would be the case if opinions about segregation are sump E NI 
Mississippi only or if opinions about international affairs are samy 
from persons in Wisconsin only, 
13-2. COMPARISON or 
ION ADMINISTERED TO 


NON 
; OPINIO 
THE REsuLTS OBTAINED rrom AN O 


ARGE 
^ SMALL Locar, SAMPLE AND TO A LAR 
NATIONAL SAMPLE 


(Adapted from Cantril, 4, p. 158) 


Question: If President Roosevelt made 
sary to ration gasoline to reduce everybody 


f s : 3 : ou be 
third from what is being driven now in order to save rubber. would you 
willing to see this done? 


es- 
: red K » neces 
à radio talk explaining that it would be 1 


ng as one- 
s driving by as much as 


Small New Jersey sample, per cent 


National poll, per cent 
Yes 88 Yes 87 
No 8 No 8 
Qualified Qualified 
answer 4 answer 1 
No opinion 0 


No Opinion 4 
Opinion Panels. One of the larger expenses 


in opinion polling is the 
gathering of a new sample of Persons for e. Stionnaire which is 
used. A practice which has develope ars is that of forming 
a panei a e sample of are Periodically inter- 
viewed or mailed questionnaires, There are some distinct practical 
advantages to using a panel and also some likely risks. Once the panel 
is formed, the gathering of Opinions is much less expensive th 
subjects have to be obtained for each study. Because the 
used over and over, it is not unduly expensive to obtain 
sentative (accurate) sample than is generally Possible 
though it is difficult to handle oe cr by 15 8 
individuals participate, the members of a panel soon le 


arn “nough about 
the use of questionnaires so that studies can be handled entirely by mail. 


an if new 
Panel will be 
à more repre- 


otherwise, Al- 
ail the fi 


Opinions, Attitudes, and Interests 293 
There is a great saving in the use of this method over the use of many 
personal interviewers. 

The principal danger in using a panel is that the continued experience 
with questionnaires is likely to change people's viewpoints or at least 
change what thev say about their viewpoints. This is more likely to 
happen when the majority of the questions are on closely related topics. 
and particularly when the questions concern the knowledge of issues. 
For example, if a series of studies is made on popular opinions about 
health, the panel members are likely to think more about health problems 
than formerly, read material relating to health issues, and be more aware 
of health information in the mass media of communications. Conse- 
quently, the panel members will tend to become more knowledgeable 
than the average person and perhaps also will change their feelings about 
many of the issues. Although the panel may have originally been a 
representative sample of persons, the group can become unrepresentative 
through the very act of participating in a series of studies. 

Not a great deal is known at the present time about how panels change 
Over time. The practical advantages of using a panel merit more research 
n this problem. It is safer to use a panel when successive questionnaires 
concern relatively unrelated topics. Also, it is good to retire periodically 
Some of the panel members and sample new persons to replace them. In 
this way there is a gradual turnover in panel membership and, conse- 
quently, less likelihood of building up an unrepresentative sample of 
respondents. 


PRINCIPLES OF OPINION POLLING 


Even if the requirements of an accurate and precise sample are met, 
there are still many ways in which opinion polls can go wrong. Careful 
attention must be ‘paid to the wording of questions, the way in which 
Opinions are obtained, and the biases of the interviewer. 

Interviewer Bias. The interviewer is supposedly an impartial recorder 
of opinions who should have no influence on the responses which are 
obtained. A series of studies (see 4, chap. 8) shows that this is far from 
the case. It has been found that interviewers who hold a particular 
E dip asinis tend i find eh wo em de roga desea 
i pinion. In one study he follo g question was asked of 
interviewers (4, p. 108): 

Wu s DE two things do you think is more important for the 
ates to try to do— 
To keep out of war ourselves, or 


To hel 3 ? 
p Engl i en at the risk of getting i "m 
selves nis gland win, even at the getting into war our- 
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: S ers 
à P isis " T | interview 
After the interviewers’ own opinions were obtained, the i 


ro en 
: A : arsons rera AD 
asked the same question of a national sample. Comparisons 

made of the results obtaine 


ne 
z p ] : <eeping out of th 
d by interviewers favoring keeping ¢ 

war and favoring he 


in Table 13-3. 
ping England. The results are shown in Table 
ers! OPINIONS ON 
TABLE 13-3, COMPARISON Or INTERVIEWERS’ AND RrsPONDENTS’ OPINI 
THE SAME QUESTION 
(Adapted from Cantril, 4, p. 109) 


interviewer E Opinions of 8 Harcourt 
reported by interviewer 
a 
Helping England Favored helping England 60 
Favored keeping out 40 
Keeping out Favored helping England 44 
F'avored kecping out 56 


There is a strikin 


g difference in the response 
of interviewe 


TS, showing that interviewers t] 
obtained from the public, A bre 


s obtained by the two — 
lemselves affect the response 

akdown of the differences in terms of 27 
of city showed that the largest bias occurred in small towns and in rura 
areas. No difference between the two groups of interviewers occurred ap 
cities with over 100,000 persons. Because interviewers in small towns 
often know many of the inhabitants, it is likely that they tend to pick 
respondents who hold Opinions similar to their own. Another possibility 
is that the townspeople know the interviewer and his general opinions 
and that they give similar opinions in order to please the interviewer: 
Interviewers with different opinions on unions, Political candidates, and 
other topics tend to get different responses in polling, 

Whatever the subtle factors working in the re] 
interviewer's opinions and the responses which 
bias should be controlled as much 
kind occurs more strongly in sm 
sample opinions rather than a local interviewer, Also. 
be trained to discount their own opinions in the s 
ing of subjects in so far as that is possible. One interesting control is 
that used by Gallup in the polling of election sentiment, He employs half 
Republican and half Democratic interviewers to help balance out the 
effect of the interviewers’ own opinions. 

The type of person the interviewer is can bias responses 
the opinions which he holds. This occurs most prominent} 
interviewer is of a different race or ethnic group th 
respondent. Cantril (4, pp. 114-115) reports a study in w 


ationship between the 
he Obtains, this form of 
às possible, Knowing that bias of this 
all towns, ise an outsider to 
» interviewers should 
ampling and interview- 


it is Wise to 1 


as much as 


an that of the 
hich Negro and 
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zhi H "m 
— — asked the same questions of Negro respondents. The 
len s "s " in Table 13-4. The results show that strikingly different 
fne E 5 by the ps kinds of interviewers: In this case it is 
fea. yir ai er that the Negro interviewers were obtaining more 
ls gece 2 — It would be expected that large differences in responses 
en ES in Table 18-4 would occur only on questions relating 
mic relations. 
TABLE 13-4, Responses OBTAINED FROM NEGRO RESPONDENTS BY WHITE 
AND NEGRO INTERVIEWERS 
(Adapted from Cantril, 4, p. 116) 


Negro interviewers White interviewers 


Question 
reported, per cent reported, per cent 
EE 
W. j * 
dee Negroes be treated Better 9 Better 2 
the J: Sr “worse here if Same 32 Same 20 
thie UEM conquered Worse 95 Ww orse 45 
W S.A No opinion 34 No opinion 33 
oad Negroes be treated Better 3 Better 1 
the Ms Or worse here if Same 22 Same 15 
US RZIS conquered the Worse 45 Worse 60 
5 SAP No opinion 30 No opinion 24 
a it is more Beat Axis 39 Beat Axis 62 
o5 be 7775 to concentrate Make democ- Make democ- 
ting the Axis, or racy work 35 racy work 26 
2 No opinion 12 


to mak 
jm nake democracy work No opinion 
ler here at home? 


yi oni Bias. In addition to the bias which is caused by the inter- 
isan S » interview situation itself can distort responses. The interview 
responc ^" seats where at least the opima pc is present to hear the 
there — opinions. When interviews are condunt on the street, 
ase d one or several friends of the respondent who. hear the 
thé; hae eople tend to express more socially acceptable opinions when 
Mi dena m the presence of other individuals than when the opinion 
1 ——— pars is filled out in privacy. Cantril (4, chap. 3) reports a com- 
loger Es results obtained from a secret response with those from an 
eee be sample of subjects was asked the questions in the usual 
themselves "s DL A matched sample filled out the questionnaire by 
Table 13-5, m own homes. Results from one question are shown in 
favor helping En ee socially acceptable response at the time was to 

8 England as much as possible, and consequently, the major- 
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‘tiation. When 
ity of the respondents gave that opinion in the interview oit 19 8 5 
respondents were allowed the privacy of their own p Interview 
more inclined to express a less socially acceptable i onda ire about 
bias is more likely to occur when there is strong social pn a fie mes 
an issue. In the example above, the government, most arms i backing 
media of communications, as well as many organizations were 
aid for England, 


AND 
maini Hm 
> INTERVIE 
TABLE 13-5. COMPARISON OF RESULTS OBTAINED FROM AN IN 


FROM A SECRET RESPONSE TO AN OPINION Ques 
(Adapted from Cantril, 4, p. 79) 


ION 


Question: Do you think that the 


M 
» fighting 
English will try to get us to do most of the 
for them in this W. 


of the 


ir fair share 

ar, or do you think they will do their fair sh 

fighting? = 

m't knows 

Will do their fair share Will try to get us to do gd cent 
of the fighting, Per cent their fighting, per cent I 18 

Interview 57 2 16 

Secret response 42 42 2 

Difference 15 17 


No-opinion Respondents, One 
poller is that of inte 


answer questions, h 
their minds. When 
cent, expre 


‘nion 
of the difficulties that besets the duse to 
rpreting the results from respondents who av up 
ave no opinion, or who say they have not 1 DU per 
a sizable number of the respondents, say, over Scion? 
58 no opinion, it makes the over-all results somewhat coe 
able. The Customary procedure is to divide the no-opinion 5 an 
Proportionally to the responses of those persons who do e 
opinion. For example, if in an election poll, 45 per cent of the [ 
say they wil] vote for candidate A, 40 per ce 
candidate B, and the rem 
up their minds, the 15 
Would give 


nt say they will vote n 
aining 15 per cent say that they have ssn po 
Per cent would be divided proportionally : cent 
an estimated vote of 52.9 per cent for A and 47.1 ae ator 
for B. In Spite of the arguments that have been made point S ben 
portional division of opinion respondents, the method has often wi 

Well in practice, 


A 3 of differe 
ording of Questions. There are usually a number ff 


nt ways to 
ask a question, 


and the responses will often be Pi 8 by 5 wording 
8 in; i s use the 
which is chosen. The first rule in opinion polling is to use the 


e publie wil] 
terms possible. Otherwise, a large segment of the public wi no 
Stand the question, 

Prestige n 
they 


simplest 
t under- 


ames and concepts should be avoided in questions unless 
— issues. Table 13-6 compares the results which 
are pertinent to the issues. 
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Weri E A 4 8 4 ; 
ere obtained from asking a question using Roosevelt’s name with the 
sa ett ib EA acer ifvi ei i 
p question asked without identifving the issue with Roosevelt (4, p. 
39). Emotionally toned words such as Communist, radical, and fascist 


Will alen = A 
ill also affect responses. 


Ta = 
BLE 13-6. COMPARISON OF THE RESPONSES TO AN Option QUESTION USING 
BRooskvELT's NAME AND WITHOUT His NAME 
(Adapted from Cantril, 4, p. 40) 


it in order to keep the Germans out 
of North and South America we must prevent them from captur- 
ing islands off the west coast of Africa. Do you think that we 
should try to keep the Germans out of the islands off the west 


coast of Africa? 


Question, Form A: It has been said recently thi 


aid recently that in order to keep the Ger- 
South America we must prevent them 
coast of Africa. Do you 
srmans out of the islands 


Question, Form B: President Roosevelt s 
mans out of North and 
from capturing the i ands off the wes 
think that we should try to keep the € 
off the west coast of Africa? 


Without Roosevelt's name, 


With Roosevelt's name, 
per cent 


per cent 


Yes 56 50 
No 24 21 
No opinion 20 29 


an must be taken not to word a question in such a way that several 
5 ene interpretations are possible. Respondents (4, p. 6) were asked 
abs a ae If the German Army overthrew Hitler and then offered to 
x r dad and discuss peace terms with the Alies, would you favor 
itis ae accepting the offer of the German Army?" A further inquiry 
ea ae people interpreted the question showed that the meaning varied 
bein T or different. individuals. Some persons even thought they were 
g asked whether or not they were in favor of Hitler bein er- 
8 ) g over 
are of Principles. The previous sections have discussed only the 
ini ways in which opinion polls can go wrong (for more 
1 * iscussions, see 1 and 4). The intention here is not to cast doubt 
ant "e pen polling. Much of the research has been of a high quality, 
learned is A ts have proved valid on many occasions. The lesson to be 
little faith o opinion poll should be conducted by experts, and that 
much is now S 8 80 . m 1 15 attempts. Although 
opinion polls and per ey UN „ that enter into 
sources of bias to eat E PA at 20 5 on always be unknown 
particular resu ts. Only by repeated studies with 
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i f 
" A 2 , A ding o 
different. interview techniques, sampling procedures, and we £ 
questions, can exact measures of public opinion be obtained. 


CHARACTERISTICS OF OPINIONS 


ere f 

The results of an opinion poll provide only a surface pia Un 
people's thoughts, Much more needs to be known about ipe "i: 
they develop, their stability, the intensity with which they are ne 
their relationship to overt behavior, The practical and directly profi lung 
task of measuring opinions has received the widest attention by po der 
agencies. Only during the last twenty years have inquiries been un 
taken into the psychological background of opinions. c 

Reliability. Opinion questions are open to measurement error diris 
all test items are, If the respondent is asked the same question on n 
occasions, he might give different answers, This may reflect a real AR 
in opinion or only an element of chance in the way in which he psc spei 
If the respondent knows little about the issue or has no firm spacio 
his response might be little more than a mental coin flip. Unreliabi s 
can come from interviewers who interpret responses differently, 0 
dentally record a “yes” when the Tespondent says “no,” or mix up M 
answers to different questions. Statistica] errors in recording and an? 
lyzing the data can add additiona] unreliability, ly 

Because each question in an opinion poll is usually analyzed esee 
rather than added with others as is done on most tests, the reliabi Ey 
of the results is dependent on the separate questions. Only a small amon 
of research has been done on the reliability of opinion questions. T x 
available data indicates fair to good reliability for different ore 
with per cents of agreement often above 80 for repeated interviews wi 
the same questions (see 4, chap. 7). e 

The unreliability in opinion questions tends to reduce the € 
between response categories. That is, if a question is to be sate 
either yes or no, whatever unreliability is present will bring the over-a 
percentages of yes's and of no’s closer to the 50 per cent mark. If a aues 
tion were completely unreliable, the expectation is that exactly 50 pos 
cent of the answers would fall in each of the two categories, In the same 
Way, any unreliability in opinion questions will reduce the „ 
etween the responses of any two groups, such. as In a comparison o 
abor and management opinions. Consequently, it can be expected 
the raw results from an opinion poll are conservative estimates of the 
differences between response alternatives and between the responses by 

ifferent grou le. , 

Stability. Even tough an opinion might nod or Dj, measured at one 
point in time, it may change over a short period of time. nis seems to 


as 
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have occurred in the 1948 Presidential contest between Truman and 
Dewey. The voting sentiment apparently shifted to Truman in the last 
few weeks before the election. Partly because the polling agencies did 
not sample opinion during the period immediately before the election, 
they mispredicted and lost considerable public confidence for several 
years thereafter. 

Large shifts in opinion are more likely 
being discussed widely in the mass media of communications. Opinions 
are also more likely to change when the individual has no first-hand 


experience with the issue and must rely on secondary sources of informa- 

tion. Opinions generally change more easily on issues which do not affect 
* s i z sae H H 

av. One critical incident can 


the individual in an immediate personal WW. 

often have a marked effect on public opinion, such as the Japanese attack 
on Pearl Harbor, Public opinion often changes markedly when a highly 
respected person advocates a particular viewpoint. 

Knowledge of Issues. Two individuals who give the same response to 
an opinion question may do so with very different amounts of knowledge 
about the issues, An individual may agree that it would be a good idea 
for the government to build and operate a dam on the Snake River but 
have no idea where the Snake River is or have any conception of the con- 
Sequences of the government action. In many cases, an individual’s opinion 
May be based on incorrect information rather than lack of information. 
Whenever the polling question is at all complicated, it is useful to give a 


about the issues along with the usual polling ques- 
that the more informed people 


ed people, which leads in turn 


to occur on issues which are 


Short information quiz 
tions. It is sometimes found in this way 
have different opinions from the uninform 
to Suggestions as to how certain segments of the public can be educated 


about an issue. If the opinion poll results are being used in an advisory 
‘iad to help in forming the policy of an institution, more faith can usually 
de placed in the sagacity of the opinions of people who know something 


about the issues. 

Strength of Conviction. Another dimension of opinions is the strength 
With which viewpoints are held. One person may say “yes” to the ques- 
tion “Should we help rearm Germany” and feel very strongly about it, 
very sure that it is the proper thing to do. Another individual may also 
answer “yes” but do so with considerable indecision and lack of sureness. 
Respondents are often asked to rate the strength of their conviction after 
responding to a polling question. One procedure is to ask the individual 
to endorse one of several statements which best describes the strength 
of his conviction, such as the following (see 4, p- 57): 

Tam not at all strongly convinced on this matter. 

I guess it is the best thing todo 

Iam absolutely convinced this js what ought to be done. 
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It is important to know 


n ction for 
i 5 s convicti 
the strength of a respondent 
Several reasons, When the 


;desidual is 
strength of conviction is low, the "e sedia 
more open to outside influence and more likely to change iubes socia 
People who hold beliefs strongly usually exert more TRUE arene. 
groups and in civic affairs than individuals whose beliefs are yee. often 
A hard-core minority of people who strongly hold an opinion Cc 
win over the less certain larger group. in whid the 

Validity. The validity of opinion polls depends on the wav in W nt, 
results are used. Many studies intend ¢ i 
to discover what people say about i 
if it is free from major sources of bi 
apter. Sometimes the re 


only to gauge the public M 
ues. The opinion poll is then aa 
as like those mentioned ps dmt 
sults of opinion polls are used Et how 
how individuals will behave later. Election polls intend to predic redict 
individuals will vote, and most market research studies intend to; e 0 
What individuals will buy, In these cases, the predictive valic all? 
opinion poll results can be measured in terms of how well they aS tec 
forecast future behavior, Experience has shown that the carefully sae 0 
opinion poll can often predict future behavior with a high degre 
accuracy, issues 
The predictive X of opinion polls depends very much on the Eo or 
involved, If there are strong socia] pressures to hold one key qup 
preference Over others, the results of an opinion poll may serve as aen 
predictor of how People will actually behave. A case in point is the 8 80 
Probably apocryphal, concerning the study of preferences for new s aid 
mobiles conducted at the end of the Second World War. Most people ith 
that they favored utilitarian, economical, low horsepower, os 
unadorned automobiles, Being suspicious of the results, the polling Soil of 
undertook another poll in which each person was asked, "What M: ‘abe 
automobile does the average person want?” The dominant oe, 
tained in the second poll was of a chrome-plated, leather-upholst sae 
high horsepower, many-gadgeted palace on wheels. The results 1 
from the initial study would, as time has shown, have served M , 555 
in forecasting the type of automobile that people would actually 1 
during the following decade. This also illustrates the need to use a varict) 


ith in 
t : HO d ste faith ir 
of questions concerning an issue rather than to place complete fa 

one question only, 


this ch 


validity 


ATTITUDE SCALES 


Attitudes are predis 
degree toward an o 
and positive re 
However, 


positions to react negatively or positively in ses 
bject, institution, or class of canes 1 jue 
actions are depends in part on ine paren zs Saen 
it is rather universal to consider certain actions as negative an 
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Positive, or as evidence of likes and dislikes. A negative attitude toward 
another person is manifested in wanting to see bodily harm done to the 
individual, wanting to be removed from his presence, wishing that he 
Would suffer a financial loss, wanting him to lose positions of prominence, 
and so on. The negativeness and positiveness of attitudes are intuitive 
Concepts which are difficult to define concretely. 

There are many ways in which a person can react negatively or posi- 
tively, ranging from a refusal to buy merchandise from a Jewish business- 
man to joining the Ku Klux Klan. Verbal reactions are the most widely 
studied attitudinal reactions. People are asked either to agree or disagree 
with statements like, “The government should stop letting Italians enter 
this country” and “We should do everything possible to make Italian 
immigrants: feel at home in this country," or some more complex method 
of rating is used. Verbal reactions are studied more than other kinds of 
behavior because of the ease with which they can be handled in research 
investigations. It is much simpler to have subjects react to a list of state- 
Ments in a questionnaire than to observe how individuals behave in their 
daily lives, Although interesting new types of data might be found by 
Studying how people actually behave, the enormous labor and expense 
which are entailed have permitted only a few investigations of this kind. 
All of the instruments which will be discussed in this section concern 
what people say about their feelings. 

The object in the measurement of attitudes is to locate each person at 
some point on a continuum, or scale, which ranges from. "strong negative 
attitude” through “neutral” to "strong positive attitude.” On the surface 
this seems like a relatively simple problem. Why not simply ask the indi- 
Vidual how he feels about an object or class of persons? This could be 
done by having the individual rate, say, Negroes on a continuum ranging 
from “dislike very much” to “like very much.” Single ratings of over-all 
Teactions are sometimes used in attitude studies, and although they are 
Useful for making gross comparisons of people, they offer a number of 
drawbacks, rg 
! If an individual is asked to give his over-all reaction on a single con- 
tinuum, this does not provide a reliable indication of the intensity of his 
reaction. The reliability of single-rating scales is usually too low to pro- 
Vide a good estimate of attitudes. A more reliable location of the indi- 
vidual on an attitude continuum can be obtained by combining the scores 
made on a number of items. 

Another reason why devices other than the single rating scale must be 
used is that over-all expressions of likes and dislikes are often dominated 
by stereotyped reactions which do not represent true feelings. People will 
rate that they are either neutral or favorable toward Negroes, vet they 
might also endorse more specific statements like, “I would not like to have 
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my children in an ‘integrated’ school,” 
commit crimes than are w 


> : ; m hich 
An attitude scale is derived from a collection of statements all of w 


À es : object 
concern different degrees of positive and negative reaction to the obj 


à > SEPS 8 ar not the 
or person being studied. First it must be determined whether or no 
statements concern only one 


attitude. It is possible that there are ge 05 
even several attitudes involved in the items, each of which should ^This 
à separate scale rather than be mingled in one. A statement such d ht be 
City should help provide housing for migrant Southern Negroes” mig 8 ve 
included in a scale to measure attitudes toward Negroes. It might pro 8 
to be as much or more an attitude toward the use of civic funds to con 
pete with private industry, a 

After obtaining a group of items which concern only one vipsan 
method of administering and Scoring the items must be determine¢ ww 
that people will be located as precisely as possible along the under ie 
continuum, or scale. These are the principal steps that are involved le 
attitude Scaling. Some of the major techniques for obtaining attituc 
scales will be discussed in the following sections. ill 

Thurstone Method of Scale Construction. One of the earliest and ee 
most widely used methods of Senerating attitude scales was Reihe 
by Thurstone (see 19) and his colleagues, The outstanding feature of t p 
method is the use of judges to determine the points on the attitude con- 
tinuum. The first step in constructing the scale is to gather several hun 
dred statements which seem to express various degrees of negative a 
positive attitudes toward the class of persons or objects being guais 
One way of obtaining statements is to have a group of people write abon 
their attitudes toward the particular person or object. Other satanen > 
can be obtained from popular literature on the issue. The experimenter 
can usually add numerous statements from their own experiences. " 

Each statement is reproduced on a separate slip of paper. Several bun 
dred persons are then chosen as judges. Each judge is handed the entire 
collection of statements and asked to rate them on an 11-point continuu™ 
ranging from “extremely favorable,” through “neutral,” to “extremely 
unfavorable.” It is important to note that the judges do not rate the exten 
to which they agree with each of the statements; they judge the intensity 
of each statement. 

The next step in constructing a scale of the Thurstone type is to com- 
pute indices of variability for each item. If all of the judges place a state- 
ment in the eighth, ninth, and tenth piles, this represents good agree- 
ment about the intensity of the item. If the placements are scattered all 
up and down the 11 piles, this indicates that the statement is either 
ambiguous or perhaps belongs to some other attitudinal factor. The st 
ard deviation of the placements of an item by the judges would ser 


re lik to 
and “Negroes are morc likely 
hite people." 


and- 
ve as 
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an adequate measure of variability. The index which is actually used is 
one-half the distance between the 25th and 75th percentile of the judges’ 
ratings. Only those statements are retained for the attitude scale that have 
relatively low interjudge variability. 

The final step in constructing scales of the Thurstone type is to select 
from the remaining statements a group that spreads evenly over the 
intensity range. For this purpose the median intensity score is determined 
for cach item. If half the judges place a statement at or above the seventh 
pile and half place it at or below the seventh pile, the median intensity 
Score is 7, As is usually the case, the median lies at some point between 
two piles, and consequently medians are expressed to one decimal place, 
like 7.4 and 9.3. (The arithmetic mean would serve much the same pur- 
Pose here as the median.) The scale position for each item is then consid- 


ered to be the median intensity judgment. The final scale consists of the 


twenty or so items which spread most evenly over the intensity range. 
judgments respectively 


Ideally, the items should have median intensity 
of 0, 0.5, 1.0, 1.5, and so on to the top of the scale. 

The final attitude scale is composed by randomly ordering the state- 
ments on a printed form. The subject is instructed to mark those state- 
ments with which he agrees. The score is the median intensity of the 
Statements which are marked. If one person agrees with statements that 
9.5, 10.0, and 10.5, his score would be 10. In a 
Perfect scale the individual would mark only one statement or several 
statements with similar intensity scores. However, in practical work with 
the scales, subjects vary somewhat from this expectation. It would not be 
uncommon to find a person who agrees with statements that have intensi- 
ties of 10, 9, 8.5, and then agrees with one of 5 also. 

The scale of the Thurstone type is an economical, common-sense type 
of instrument which has been used widely during the last twenty-five 
years. Special scales have been constructed for attitudes toward war, 

€groes, censorship, patriotism, capital punishment, Chinese, and numer- 
ous others. Some of the items on the Scale for Measuring Attitudes toward 
the Church (19) are shown in Table 13-7. 

The use of judges to estimate the intensity of 
and weak point of the Thurstone scale. It is an advantage in that the 
Scale positions have a rational meaning that would be difficult to obtain 
by any other method. If the judges are in good agreement that a respond- 
ent who marks a particular statement has a positive or negative attitude 
to a certain degree, this tells us something directly about the respondent. 
The use of judges is particularly helpful in locating the neutral point on 
the attitude scale, Attitudes are inherently bipolar, ranging in opposite 
directions from à hypothetical neutral point. In campaigns to change par- 
ticular attitudes, it js very important to locate individuals in respect to 


have intensity indices of 


attitudes is both a strong 
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the neutral point. It is much e 
vidual whose attitude 
Sometimes it is more 
tral point the individu 
the neutral point in e 


asier to bring a positive change in 1 

is slightly positive rather than eda es 

important to know precisely which side o a from 

tal is on rather than to know the exact distance 

ither direction, 

TABLE 13-7, ILLUSTRATIVE Ir 
Mzasunk or 


0 PE 
s TuvnsroxEe TY 
MS AND SCALE VALUES FOR A THURSTO 
ATTITUDES TOWARD THE CHURCH 


(Adapted from Thurstone and Chave, 19, pp. 60. 


33) 


Scale value 


Item — 
2 believe the church is the greatest institution in America today. 
1.0 I believe tha 


cq! of 
z arv rposc 
t the church has grown up with the primary purp 
perpetuating the spirit and te 


support. 
achings of Jesus and deserves loa SE 
2.2 I like to go to church, for I Set something worthwhile to think abou 
it keeps my mind filled with tight thoughts, find the 
3.1 I do not understand the dogmas or creeds of the church but I fir 
church helps me to be more honest and creditable, 
4.0 


When I go to church I enjoy a fine ritual service with good mS i 
5.1 I like the ceremonies of my church but do not miss them much 
I stay away, 
6.1 I feel the ne 
church, 
I believe th 
tions to be a strong force for ri 
8.2 The paternal and benevolent 
to me. 
My experience 
I think the org: 


ne 
A 5 54 in any 0 
ed for religion but do not find what I want in an) 


ina- 
“at TETUR nomit 
at churches are too much divided by factions and de 


ightcousness, , cotasteful 
attitude of the church is quite distas 


is that the church is ho 


pele 
anized chure 


in enem 


' out of date. 
ience and truth. 


The Thurstone 
independent of th 
Case, th 
Negroes 


method of sc. 
e judges’ own attitude: 
at, say, two individuals who 
agree on the inte 
of intensity are not inde 
much a function of the 
the scale positions 
be scored 


are 
: T nts ar 
aling assumes that intensity judgme 


s. It is conceivable that this is el 
have different attitudes oai 
nsity of statements about Negroes. If aie opt 
pendent of attitudes, the scale positions Rem d i 
judges who are chosen. This would change a id 
and would shift the neutral point as well. What wou i 
a positive attitude on a scale obtained from one group m 
judges might be scored as a negative attitude on a scale obtained from ¿ 
different group of judges. AA Aes 
A ARA of studi (6, 8, 9) report orano a "niei 
from different kinds of judges. Hinckley (8) emp eae inch Bp c 1 d 
ern white Students who were prejudiced e N SUR uie E e 
intensity of 114 statements. The scale positions obtained were compare 
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with those for a group of Northern white students who were favorable 
toward Negroes. The two scales correlate .98, showing a very close agree- 
ment in the two relative orderings of the items. This and other studies 
show that at least the relative position of statements on the scale can often 
be determined without considering the attitudes of the judges themselves. 
However, the studies show that there are often significant differences in 
the absolute intensity of statements when judged by groups that hold dif- 
ferent attitudes. (It must be remembered that the correlation coefficient 
is a measure of covariation, not a measure of agreement.) A prejudiced 
Sroup and a nonprejudiced group may agree that one statement is more 
Positive than another, but they may differ in their judgments of how posi- 
tive the statements are. When the scale values are dependent in part on 
the particular group of judges which is used, the neutral point on the 
seale cannot be determined precisely. 

A potentially weak point in scales of the Thurstone type is the absence 
of any direct procedure for determining whether or not there is only one 
attitude involved in the statements. Statements that relate prominently 
to attitudes other than the one being studied would probably have high 
removed from the scale for that rea- 


Interjudge variability and would be I 
i thod for purifying the scale. As is 


Son. However, this is not a precise me 
Sometimes done, a scale of the Thurstone type can be purified of extrane- 
those items that show a substantial degree 


ous factors by retaining only 
19 for the procedure which 


of homogeneity with the scale as a whole (see 
is used), $ 

Likert Method of Scale Construction. The Likert method (see 11) also 
starts with the collection of a large number of positive and negative state- 
ments about an object, institution, Or class of persons. Unlike the 
Thurstone approach, judges are not employed in the Likert method. 
Instead, the scale is derived by item-analysis techniques. The collection 
of items is administered to a group of subjects. Each item is rated on a 
five-point continuum ranging from “strongly approve” to “strongly dis- 
approve.” Then cach item is correlated with total score, which shows the 
extent to which the item measures the same general underlying attitude 
as the total set of items. Items which have low correlations with total 
ure some extraneous attitude factor. 


Score are either unreliable or meas 
Only those items which have high correlations with total score are retained 
for the attitude scale, The Likert scaling procedure helps ensure that the 
final scale concerns only one general attitude and that individuals can be 
located with at least madere precision at different points on the scale. 
Comparing the Likert and Thurstone methods, the Likert approach is 
more empirical because it deals directly with respondents’ scores rather 
than employing judges. The Likert method more directly determines 
sieten onur only one attitude is involved in the original collecti ie 
" 5 ection of 
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sneral atti- 
items, and the scale which is derived measures the most ge. Tem 
tudinal factor which is present. The use of a five-point scale for i or “dis- 
provides more information than the simple dichotomy of 9 1 superior 
agree.“ The only place in which the Thurstone method might nest caf 
is in the direct meaningfulness of scale Scores. No absolute mes “ E the 
be given to the points on the Likert scale. The scores are 8 65 gather 
group from which the scale is constructed, It would be teli i s above 
more general norms, to at least be able to say that an individua btainec. 
or below average by so much, but large-scale norms are seldom “Tia 
However, as was pointed out above, the meaning of scores on the 
stone scale is sometimes dependent on the 


judges who are used. 
On the final scale of the 


„ment 
Likert type, the subject marks each S ded 
in one of the categories of strongly agree (SA), agree (A), unc porch 
CUL); disagree (D), and strongly disagree (SD). If the subject eis 
“strongly agree” on a Positive statement, a score of 5 is given, cies The 
given a score of 4, and so on to a score of 1 for “strongly ee is 
Scoring system is reversed for negative items, A "strongly agree ms The 
given a score of ] and so on to 5 for a "strongly disagree" dila ens 
individual's final Score is obtained by summing the item scores. Illu 
tive items from one of the scales are shown in Table 13-8. 


TABLE 13-8, ILLUSTRATIVE ITEMS FROM 


zor THE 
^ Likerr Type SCALE FOR 
MEASUREMENT OF CIVILIAN 


MORALE DURING THE DEPRESSION 


(Adapted from Rundquist and Sletto, 12) 
The future looks v. 


ery black, 


SA A D SD 
Times are Setting better, SA A U D d 
No one cares much what happens to you. SA A U D SI 
Real friends are as easy to find as ever, SA A U D SD 
It is difficult to think clearly these days, SA A U D SD 
There is really no point in living j SA A U D SD 
It is great to be living in these exciting times. SA A U D 5D 
Most People can be trusted. SA A U D SD 


The Bogardus Social-distance Scale. As early as 1925, Bogardus AE 
developed a technique for measuring attitudes toward different node 
Sroups. Rather than being a scaling procedure, such Fen Ser 8 
and Likert methods, the Bogardus instrument is identifie PY a novel typ 
of item. The items concern the desired social pate es from 
national group. The following categories of socia istance 
employed in the scales: 

1. To close kinship by marriage 
2. To my club as personal chums 
3. To my street as neighbors 


a particular 
are usually 
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To employment in my occupation 
To citizenship in my country 
As visitors only in my country 

7. Would exclude from my country 
The scale can be used for any national group simply by placing the name, 
such as English or Russians, at the head of the social-distance categories. 
ategories which suit his feelings about the 


o5 


The respondent marks those c: 
particular national group. 
No formal procedures are 
COnstitute a scale in terms of the requirements given e 
Tespondents use the categories as though they are points on a scale. Very 
few inconsistent responses are found, eg, a willingness to accept a 
Russian to close kinship by marriage but an unwillingness to have a 
Russian as a neighbor. 
Very little research has been done to determine correlations between 
responses on the social-distance scale and those on more conventional atti- 


tude scales such as the Thurstone and Likert instruments. Social distance 


is both an interesting and a complex concept. A respondent may express 
group such as Norwegians not 


a desire for social distance from a national 
"ecause he dislikes the class of people but because they are unfamiliar. A 
Person might dislike the English personally but be willing to accept them 
In close relationships because of the general prestige that the English 
have with most of the people in this country. 
The social-distance scale, as measured by categories like those above, 
attitudes. There are various degrees of 
contact with a group. Investi- 
loy several conventional atti- 


used to ensure that the Bogardus categories 
arlier. However, 


ails to measure extreme negative 
dislike beyond wanting to have no social 
Bators have often found it necessary to emp 
tude items along with the social-distance categories to tap extreme 
Negative feelings. This is illustrated in the following scale developed by 
respi (5) for measuring attitudes toward “conscientious objectors”: 

1. I would treat a conscientious objector no differently than I would 
any other person, even so far as having him become a close rela- 
tive in marriage. 

2. I would accept conscie 
them for friends. 

3. I would accept cons 
for speaking acquaintances. J 

4. I don't want anything to do with conscientious objectors. 

5. I feel that conscientious objectors should be imprisoned. 

6. I feel that conscientious objectors should be shot as traitors. 

Pep: etg to add statements 5 and 6 to get at strong negative 
The Bogardus scale is much easier to construct than the Thurstone or 


ntious objectors only in so far as having 


cientious objectors only so far as having them 
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x not 
the Likert instruments. It applies only to classes of people: it 3 
be used, for example, to measure attitudes toward labor unions. Vedi 
categories or a slight variant of them can be applied to . 
is being studied. One of the interesting properties of the sneial-dis 70 
scale is that it concerns actions rather than more general feelings a 
people. Although the rese; ventional 
Bogardus social-distance scale might perform better than a ;hava 
attitude measures in predicting how two groups of people will be 
when brought together. x gian 
The Guttman Method of Scale Construction. An interesting (see 
approach to attitude scaling is the procedure developed by Guttman v 
7, 14) in connection with studies of the morale of the American id test 
during the Second World War. The objective of the method is to 


he 
à e Mike e ell Al 
arch evidence on this point is very small, 
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Scale. A mark of 1 me 
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" 5 x orfect Guttm 
pattern for nine Persons on nine items for a perfect 

means that he does no 


a mark of 
ans that the person agrees with the statement, and a mar 
t agree with the statement. 
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Item 


Person 


directly whether a collection of items can be scaled on one attitude ge 
tinuum. The criterion of scalability is that if an individual d : 
more extreme item he should endorse all less extreme items also. me 
scaling criterion is applied to the scores obtained from a tryout group 
respondents. If a single scale underlies all of the Hom, the items can 
arranged to form a triangular pattern like that in open illustrate 
Figure 13-1 greatly simplifies the Guttman scale m DES er 52 88 
the Principle that is at work. Usually there bag Gone 5 85 T ans 
many more people being studied. Also, Wo ae i: 5 Er D 
a manner that only one person has each of the di pes n P SI bs: pa d Be 
Item 1 is the mos extreme attitude, let us say extremely negative in this 
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case. Only person A agrees with item 1, and as is true in a perfect scale, 
person A agrees with the second most intense statement, 2, the third most 
intense, 3, and so on to the least intense, but still negative item 9. Person B 
holds the second most intensely negative attitude, and so he endorses 
item 2 and all of those beneath it in the intensity hierarchy. Person I holds 
the least intensely negative attitude, agreeing to only the least intensely 
Negative item, 9. 

The response pattern found in the perfect Guttman scale is exactly 
what is obtained if people are rank-ordered on a physical continuum. 
Suppose, for example, that we ask people questions about their height, 
and assume that they know how tall they are. The person who answers 
“yes” to the question “Are vou above 614 feet tall?" will answer “yes” to 
"Are you above 6 fect tall?" and “yes” to all questions about heights down 
to zero, The person who answers “no” to the question “Are you above 
614 feet tall?" but does answer “yes” to the question “Are you 
above 6 feet tall?" will answer “yes” to questions about all lower heights. 
all heights, by knowing the most extreme state- 


Similarly with persons of à 
his other responses can be predicted 


Ment that a person endorses, 
perfectly, 

In the example above, we started by knowing that only one scale, 
height, was involved. The purpose of the Guttman procedure is to test 
Whether or not a collection of attitude statements will exhibit the charac- 
int notion; and if a set of items can be found 


ng evidence of scalability. 
ant practical drawbacks to the use 


teristic pattern. It is an eleg: 
Which will fit this pattern, it is convinci 

Unfortunately there are some import 
of the Guttman scaling procedure. The principal one is that almost no 
collection of sintements will meet the strict lawfulness of the scalability 
criterion, The criterion insists that each separate item be almost com- 
pletely reliable; but in practical work individual items are notoriously 
heavy with measurement error. Suggestions have been made for the use 
of approximately scalable collections of items, but even an approximation 
of scalability in the Guttman sense is rare to find. One solution to the 
Problem is to search around through the items to find a set which will 
approximately meet the criterion of scalability. This takes considerable 
advantage of chance, and it will be expected that in many cases the scal- 
ing will not hold up when the items are administered to new groups of 
subjects. 

In the few cases in which the Guttman criterion of scalability has been 
approximately met, the items prove to be so closely related in content as 
to constitute near rewordings of the same statement (see 14). There is 
little reason to believe that collections of statements such as are handled 
by the Likert procedure will scale in the Guttman sense. 

Although the Procedures of scaling identified with Guttman are not 
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Suitable for most practical work, the . The impor- 
dimensiona] scale is appealing and deserves more attention. is defined 
tant thing to note about this concept of scalability is that TAA data 
entirely in terms of rank-ordering of persons and items. The a (cate- 
with which the method begins is the set of qualitative fap ee - 
Bories), either "agree" or "disagree" on each item, given by sche quali- 
ents. The example concerning the heights of people illustrates h derlying 
tative responses can define a rank-ordering of people on an ries atti- 
scale. Most of the procedures for deriving and validating E the re- 
tude measurement devices assume that interval scales underlie yiations, 
Sponses. This is 80 because they employ means, standard 1 
Correlations, and other statistics which make sense only when an ir 
scale is assumed for the measures being studied. be made 
Fewer assumptions about Psychological data would need to be used- 
if statistica] methods that assumed only rank-order scales could be favor 
The great difficulty is that when the interval scale is abandoned nethods 
of rank-order and Categories, most of the powerful statistical ae also. 
such as multiple Correlation and factor analysis must be abandoned E 
This is Squarely the problem that is encountered in using the 1 it is 
criterion of scalability, When the criterion is not met perfectly, an tati- 
highly unlikely that it ever will be, there is little that can be done upon 
tically to determine the number and kinds of scales which aaaea t a 
responses. What is needed is a method for analyzing the responses ım- 
attitudina] statements which wil] determine the pcs 
ber and kinds of rank-order scales which are present in the data. No ueni 
his problem have yet been developed. Until they 1101 
pt of scalability and the procedures it entails sttitude 
uch practica] importance in the construction of at 
scales, 


a e 
Factored Scales, One of the most direct approaches to breui oe 
number ang kinds of scales involved in a collection of items is 1 15 can 
analyze responses. A]I of the conventional methods of factor e 
be used on attitude items in the same way that they are applied iba a 
sets of Psychological measures, Desirably, the items should be ie xd 
continuum, using, Say, five or more points, rather than to . 1 7 
“agree” or “disagree” response. After the collection of items "" h nid 
tered to a group of respondents, preferably with at least sever al 3 The 
persons included, all intercorrelations among the items iaa . n- 
intercorrelations are then factor analyzed. Each of the major factors e 
stitutes à Separate attitude scale. The ker — relate most Pro 
ne a 5 construc! ales E z 

9 5 e e factor analysis to derive attitude scales 
assumes that interval scales underlie the responses. This is an assumption 


E a uni- 
'onstitutes a 
concept of what constit 
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that is quite generally made in statistical work with psychological meas- 
ures. The tests and attitude scales which have been derived in this way 
work well in practice. Factor analysis would have been used more broadly 
in the derivation of attitude scales were it not for the statistical labors that 
are involved. Now that high-speed computational equipment is becoming 
more available, large-scale factor analysis of attitude items will probably 
become more common. 

Methods of attitude scaling like Thurstone’s and Likert’s have been 
successful without factor analysis largely because they have dealt with 
collections of statements which are dominated by one large attitudinal 
factor. Although both methods tend to eliminate items that belong to 
small extraneous factors, neither method will work well if applied to 
collections of items in which several prominent factors are present. Factor 
analysis is more necessary in deriving scales in rather subtle attitude 
domains, like attitudes of employees toward working conditions in a 
particular factory and attitudes of people toward government spending. 
In terms of employee attitudes, it is likely that a factor would be found 
of over-all negative and positive attitude toward the work; but it would 
be expected also to find separable factors relating to different aspects of 
the work, such as danger of working conditions, adequacy of tools and 
equipment, and pleasantness of surroundings. 

The Interview. A not uncommon criticism of attitude scales is that they 
restrict the individual’s reactions to a relatively narrow range of content 
and prevent the individual from expressing his own attitudes in his own 
words. If the attitude items have all been constructed in the narrow con- 
fines of the psychological laboratory without recourse to what people 
actually think and feel, it is possible that few pertinent statements will 
enter the final instrument. i 

Some people prefer an open interview situation to the use of attitude 
scales. The respondent is asked a series of “open end” questions like, 
“What do you think about school ‘integration’?” The respondent is then 
allowed to say whatever comes to mind, perhaps with some prodding if 
he gets too far off the track. The responses are either written down 
verbatim or pertinent parts of the responses are noted. 

The advantage of the interview is that it permits a direct inquiry into 
attitudes without assuming in advance what specific questions and alter- 
native answers should be used. The responses will not only teach the 
investigator something about the issues which loom most largely in the 
public mind but will also provide some suggestions as to why people 
feel as they do, 

There are severa] difficulties in using the interview to study attitudes 
In the face-to-face interview situation there is usually more pressure to 
give socially acceptable responses, particularly those which the res t 

> spond- 
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ent thinks wil] Please the inte 


: > anonymous 
rviewer, than is the case in the anon 
response to an attitude 


scale. In many studies, the interviewer i inp 
cise his own judgment to determine how the respondent fega d 5 which 
issue or why the respondent answers as he does. The di ate y. The 
this entails introduces unreliability into the results of the i rd any 
major difficulty with the open interview is that it is very hard uides given: 
systematic Way of scoring and summarizing the answers that us enters 
The results of the study are often highly dependent on the experim 
intuition as to what the dominant trends in the data are. t inter- 

A procedure which is used in many investigations is to es 
views Preparatory to developing more standardized measures of " ae 
Fifty to a hundred Persons varying in age and socioeconomic ge dina 
interviewed about their attitudes. This provides a rich source of atti ee 
statements as well as Suggestions about the methods of measure 
which will be most effective in subsequent studies, 


CHARACTERISTICS OF ATTITUDES 

è > as 
Reliability, Well-constructed attitude scales are usually as 5 
most aptitude tests are. As is true for all kinds of measures, the relia 1 5 
is directly dependent on the number of items in the scale and the can 
of correlation among the items. When short scales are employed apart 
ing no more than five or so items, the scores will often not be guste ed 
reliable to make Predictions about individual respondents. However, jon 
à short and relatively unreliable scale will serve to 5 i of 
tudes of whole groups of persons, such as to differentiate the attitude 
employees and management about labor unions. een 

Validity. Earlier in this chapter it was said that nearly all of the 

ures of attitudes depend on verbal reactions rather than on obser as 
of behavior, If an attitude scale is to be considered an 5 9 to 
expressed feelings, it should have content representativenes Elo: ae 
achieve content representativeness are made in most of the scaling i ne 
dures by Starting with a broad collection of the things that pen awe 
about an issue. It was recommended to use some open interviews : the 
Constructing the scale, to help ensure content representativeness for 
items. 


Many of the currently available attitude scales Ma S 
job of sampling verbal reactions to be E eia da " 1 15 
ures of expressed attitudes. The problem in va " a 9 si 1 8 5 0 
that scores are often given a wide significance eT ye oa. 45 di 

If the scores on attitude scales are used HOSCE how ma d sorde 
behave in their daily lives, there is no substitute a a Sidien d 
determine predictive validity. There are vast practical difficulties invo e 
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in doing this. It is difficult to decide what specific predictions should 
follow from attitude scales. Three employees who show the same amount 
of negative attitude toward a company may do very different things in 
respect to their expressed attitudes. One might make a speech in a union 
meeting about grievances toward the company, another might do shoddy 
work on purpose, and the third employee might do nothing except com- 
plain to his wife about the unfairness of the company. 

Some confidence in the use of an attitude scale is obtained if it can be 
shown to relate prominently to at least some other kinds of relevant 
behavior. For example, Telford (18) compared the average scores on a 
Scale measuring attitude toward the church with reports of church 
attendance (a low score indicates positive attitude ): 


Frequency of church Average score on 
attendance attitude scale 
Regularly 1.91 
Frequently 2.48 
Occasionally 
Seldom 
Never 


Sims and Patrick (13) applied a scale of the Thurstone type for measur- 
ing attitudes toward Negroes to different groups of college students, with 
the following results (a high score indicates a positive attitude): 


Average score 


Northern students 6.7 
Northern students living 

in the South 5.9 
Southern students 5.0 


Depth of Attitudes. Psychoanalysis and other forms of depth psychology 
have taught us that what the individual says about his attitudes may be 
quite different from attitudes which he expresses in other ways. The indi- 
vidual may either consciously cover up socially unacceptable attitudes, or 
he may even fool himself and be unaware that there is a conflict between 
what he says and what he does. Some investigators place little faith in 
attitude scales, claiming that they measure only superficial attitudes and 
not “real” attitudes. The difficulty is that there are no accepted procedures 
for measuring “real” or “deep” attitudes. Projective devices have some- 
times been employed for this purpose. A typical procedure for studying 
Prejudice is to show the respondent a picture of a Negro talking to : 
Bi person. The respondent is asked to guess what they are talking 
nine e des the respondent says that the white man is demand. 

s noney which the Negro stole, this could, without too much 
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e attitude 
; ; 8 " gative atti 
strain on the imagination, be considered evidence of a nega 


2 acting and 
on the part of the respondent. Procedures of this kind are interesting 

might eventually lead to new tools fo 
However, the proje: 
heavily on the sub 


r the measurement of eee 
ctive devices which are presently er "e. are 
jective judgment of the interviewer, d more 
numerous questions about the validity of the instruments. m i» are to 
suited to laboratory studies of small groups of people than i. pro- 
general use with diverse segments of the general population. ( 
jective techniques will be discussed in detail in Chapter 15.) — 
It should not be assumed that expressed attitudes are less A P say 
than other kinds of negative and positive reactions. What po events 
publicly is often more influential in determining the course Sf. often 
than whatever they may "really" fee], Expressed attitudes A ned 
derived from conflicts within the individual. A person may have Meme 
in his early years that all foreigners are to be distrusted. Later exper an 
and education could instill in him an attitude of n adi 15 
concern for people of all nations. The two attitudes sit side by ublich 
the individual, both of them genuine feelings. If the individual dn ^ the 
espouses the doctrine that we should do everything possible to he F ini 
downtrodden throughout the world, this does not necessarily mean 
the expressed attitude is false or unimportant. delve 
It is necessary for clinica] psychologists and Psychiatrists to ihe 
within the individual to learn the masking of one attitude with S 
and the conflicting feelings which lead to personal disturbance. 8 15 
for the sake of measuring public attitudes, it is usually more Hnpor pou 
learn the feelings that people will express and stand on publicly. E hat 
both reflect and determine the public image, the common viewpoin 
à £roup or a nation holds as a basis for action, a SHOW. 
The Meaning of Attitudes to the Individual. Two people may E in 
much the same response on an attitude scale and differ 1 
other ways. One variable to consider is the relative abstractness ve ihe 
concreteness of the attitudinal reaction. Some attitudes are 1 de 
sense that they are formed with only a small amount of contact 5 an 
object, person, or institution concerned. Other attitudes are — 
considerable first-hand experience. As an example of an i ila pais 
up to ten or so years ago Turks were given markedly un ao : ble Ria 
on attitude scales, This was so in spite of the fact that few o d Dne 11155 
ents had any first-hand experience with Turks. This bh Bá “a y ks i 
torical holdover from the days when the then warlike r1 z " s 
conflict with the Christian world. It is not 3 a ae d sali 
children voicing the prejudices of their parents toward race an 


i a littl rete 
gion. Such attitudes are largely second hand, based on little conc. 
€xperience, 
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Sometimes only very slight cues are used for the formation of attitudes. 
For example, rank your preferences for the following groups of people, 
rating the most preferred group 1 and the least preferred 4: 


Sylvanians 
Alusians — 
Blugdugians —̃— 
Polurasians 


The majority of the people will rate the Blugdugians as the least pre- 
ferred group-this is so in spite of the fact that these are all imaginary 
groups of people. The Blugdugians are rated low because of the unmel- 
odious sound of their name. 

There is a real danger that some of the attitude studies measure only 
incidental abstract aspects of the issues. This is particularly likely in 
measuring attitudes toward institutions and associations, such as General 
Motors, the Americans for Democratic Action, and Harvard University. 
Something can be learned about the concreteness of attitudes by employ- 
ing a standard list of questions concerning the respondents’ knowledge 
and experience with the issue. Although there is no proof of it, the expec- 
tation is that highly abstract attitudes change more readily than ones 
based on considerable concrete experience. 

Another important characteristic of an attitude is the strength with 
which it is held. Some people feel more strongly about an issue than other 
people do. A common practice in the study of attitudes is to employ a 
Strength of feeling question along with each attitude statement. That is, 
after the respondent registers his relative agreement or disagreement with 
a statement like, “All public schools should be ‘integrated’ as soon as pos- 
sible,” he is asked to rate the strength of his feeling in one of the cate- 
gories “very strongly,” “fairly strongly,” or “don’t care much one way or 
the other.” Strength of feeling corresponds closely with the negative or 
Positive intensity of the attitude. People who express either strong nega- 
tive or strong positive attitudes usually say that they feel strongly about 
the issue, People who hold nearly neutral attitudes tend to say they “don’t 
care much one way or the other.” Although it is conceivable that a person 
could feel strongly about a neutral attitude, this is seldom found. There 
is some evidence to indicate that intensely negative attitudes are held 
more strongly than intensely positive attitudes. 


INTEREST INVENTORIES 


Interests were defined earlier in the chapter as stated preferences for 
activities. As the word “stated” emphasizes, interest inventories depend 
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on the individual's honest 
At the outset some justific; 
tories. The common-sense 


x do. 
and accurate reporting of what he eas a 
ation needs to be given for using interes uld be 
üpproach to learning about MU Mania ne 
simply to ask the individual what Occupations he prefers. If the 5 man, it 
already knows that he wants to bea Physician, sea captain, or fire rinted 
would be a Waste of time to have him record his preferences on a p 


n rmation 
; Pd he $ in some new inform: 
form. The purpose in administering tests is to gain some new i 

about people. 


There is a considerable 


amount of evidence to show that stated es 
ences for Occupations are unrealistic. This is particularly so among ere 
cents and young adults, with whom interest inventories are most ne 
Young people are usu; 


rene hich 
ally quite unaware of the specific A eam dai 
are entailed in different occupations, The individual's stated pre S hvsi- 
for occupations are often prompted by glamorized stereotypes. J he P rni 
cian is remembered as the heroic figure who performs the Dia s 
operation while the gallery looks down in silent awe, The sca epa sea: 
seen holding Steadfast to the helm against the stormy onslaught of p. the 
The mental Picture of the fireman has him descending the ladder wit sax 
rescued maiden on his shoulder, A] of the. s 
unrealistic, Few phy. 
in unheroic üctivitie. 
calming the 
ity to steer 
automatic; 
ambass 


se images are of lae MR 
sicians do Surgery at all, They must spend many ae 
S such as reading medical texts, writing ea 
fears of anxious patients. The sea captain has scant PPP nich 
the ship because of the modern electronic gadgety Se 
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matters as bookkeeping, personnel m 
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k the individual about ` 
* x RE as mend- 
s for a wide range of relatively specific activities such ren 
ing a clock, Preparing written reports, and talking to groups o. AD: 
From these a diagnosis is made of the occupations which most 1 
match the intere ts of the individual. A fundamental ee in 11888 
of interest inventories is that people in different sie aan voe pied 
partially different interests. Otherwise there would d. ia e ied era one 

interest tests could be used successfully to advise people to cons 
Occupation rather than another. > earliest and still most widely 
The Strong Interest Inventory. One of the c: ent Blank (VIB) devel- 
used measures of interests is the Vocational oves tilable for men and 
" rms are available ii 

Oped by E. K. Strong (17). Separate forms bubo o EB 
: 'stions about relative Y Specific activities. 
women. The VIB employs 400 questions a „ AX B 
On most of the ient the subject indicates his preferences by mar king one 


st inventory is to : 
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of the three categories “like,” “indifferent,” “dislike.” Some illustrative 
items from the men’s form are as follows: 


Buying merchandise for a store Dp B 
Adjusting a carburetor L I D 
Interviewing men for a job Lh. DL D 


Responses to the VIB can be scored in terms of 45 occupations on the 
men's forms and 25 occupations on the women's forms. A separate scoring 
key is available for each occupation. The scoring keys were developed 
from the responses to the VIB made by successful persons in each of the 
occupational groups. Each scoring key is composed in such away as to 
differentiate the people in a particular profession from people in general. 
This procedure for developing scoring keys is referred to as criterion key- 
ing. Each scoring key consists of a set of weights to be applied to the item 
responses. The weights range from +4 to —4. A positive weight means 
that people in the profession, in accounting, say, mark “like to the item 
more frequently than do people in general. A negative weight means that 
accountants mark the “like” category less frequently than people in gen- 
eral. The larger the difference between the profession and people in 
general, the larger is the weight. If an item does not differentiate a pro- 
fession from people in general, it receives a zero weight. A considerable 
amount of research work was required to obtain the occupational keys, 
and scoring keys for new occupations are gradually being developed. 

The responses of an individual are scored on either some or all of the 
professions. The scores can be converted to standard scores, percentiles, 
or to a grading system ranging from A to C. The resulting profile of scores 
is used to interpret the individual's interests. People usually express high 
interest in a number of related professions such as mathematician, engi- 


neer, and chemist. 

The Kuder Interest Inventory. The Kuder Preference Record (10) and 
the Strong VIB are the two most widely used interest inventories. The two 
inventories present an interesting contrast in procedures of test develop- 
ment. Instead of using the “like,” “indifferent,” "dislike" response cate- 
gories like those on the VIB, the Kuder inventory presents items in triads. 
The subject picks from three activities the one that he likes most and the 
one that he likes least. Two illustrative item triads are as follows: 


Visit an art gallery Ree 
Browse in a library = 
Visit a museum —— 
Collect autographs 
Collect coins 
Collect butterflies 
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5 : scupations: 
Instead of Scoring the form in terms of numerous separate occupations, 


Scores are given in 10 general areas: outdoor, mechanical, computation 
scientific, persuasive, artistic, literary, musical, social service, and clerical. 
The 10 categories were derived by i -— 
same as factor analysis would obtain. Unlike the VIB, the 10 interest er 
on the Kuder inventory were not related in any direct way with ia 
Tesponses of people in Specific occupations. The interpretation of an = * 
est profile from the Kuder inventory is largely dependent on the jues 


à i i different 
ment of the counselor as to the Interests that are involved in diffe 
occupations, 


Although the 8 


i 19 — +h the 
tem-analysis procedures, muc 


trong and the Kuder inventories started out on in 
tracks, recent developments have brought them closer together. A num? B 
of broad interest areas have been developed for the Strong VIB by factor- 
analytic studies, Scores on these can be obtained in addition to those for 
specific Occupations, Research done since the Kuder inventory was Ps 
lished shows that the origina] logical method of analyzing interests leac 
to good predictions for certain occupationg and poor predictions for others. 
Efforts are now being made to obtain equations for predicting how 
closely an individual's interests on the Kuder inventory match those 1 
people in different occupations, However, this work is not far enoug? 
along to provide the Same empirical evidence for interpreting responses 
as is furnished by the VIB. : 

Interests and Accomplishment, Because a person is interested in certain : 
activities, such as those relating to engineering, it does not necessarily 
mean that he has the capacity for accomplishment in that field. The rela- 
tionship between interests and ability is particularly tenuous in children 
and young adolescents, The child who professes an interest in athletic 
activities, for example, may have little athletic ability, and similarly for 
artistic and scientific pursuits. However, there is an increasing congru- 
ence between interests and ability as the individual matures. It is very 
difficult for a Person to maintain an interest in activities in which he core 
stantly performs poorly. As the child matures, his interests gradually shift 

i can do at least relatively well. TAE 

Stability of Interests, Without some stability over time, scores on inter- 
est inventories would be of little use in advising people on vocational 
choices, Interests are notoriously unstable in children and adolescents. 
They begin to stabilize in the late teens and remain remarkably stable 
throughout adulthood. Strong (16) found retest correlations in the 70s 
and .80s over intervals as long as twenty-two years. This is both a credit 
to the VIB and strong evidence that interests are relatively enduring 
characteristics of human adults. 

Interest Inventories in Vocational Guidance. Interest inventories are 
second only to intelligence tests as aids to vocational guidance, Interests 
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are, at least theoretically, very important to consider in choosing occupa- 
tions. If an individual really likes a particular type of work, he can often 
succeed in spite of only a moderate amount of aptitude. No matter how 
much initial aptitude a person has, he can fail in a line of work through 
inattention and lack of effort. 

In vocational guidance, interest inventories are used for two related 
purposes: to predict satisfaction in the work and to predict successful 
performance. The criterion keying on the Strong inventory provides some 
supporting evidence that interest tests can predict future satisfaction on 
the job. Another type of evidence is that follow-up studies of individuals 
who completed the VIB in college show a strong tendency for people to 
enter occupations similar to their expressed interests. Both these pieces of 
evidence also tend to support the hypothesis that interests are predictive, 
at least to some extent, of job performance. Strong (15, 17) has gathered 
more direct evidence to show that interest scores are predictive of per- 
formance in some occupations. For example, there is a relationship 
between the amount of interest shown on the key for insurance agents 
and the amount of insurance which agents sell. 

Even though interests are, at least theoretically, very important to con- 
sider in choosing occupations, it does not necessarily follow that the 
available instruments are maximally effective measures of interests. As is 
true in most areas of testing, a great deal more research with interest 
inventories is needed. 

It is unfortunate that interest tests cannot be used as successfully in 
the selection of people for particular jobs as they can in vocational guid- 
ance, It has been shown repeatedly that people can fake interest tests to 
a marked extent. If people are told to mark the Kuder or the Strong 
inventory as a successful engineer or physician would, they will obtain 
profiles similar to the profession in question. 

People usually give honest responses in a vocational guidance situation. 
They are there for information and advice, and there is little to gain by 
faking an interest inventory one way or the other. If, as is usually the 
case, the vocational guidance facility is not connected with personnel- 
selection programs, there is no way in which test scores can lower the 
individual's chances of getting a particular job. When an individual 
applies for a particular job, he is seldom as desirous of learning about 
himself as he is of obtaining the position. If the individual is applying for 
a job as an electrician, he knows that it behooves him to answer "yes" to 
an interest item like, “Do you like to repair electrical motors?" The small 
amount of success that interest inventories meet in personnel-selection 
programs should not mar the important place they have in vocational 
guidance. 
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CHAPTER 14 


Inventories for Self-description 


: rned wi ^asures of 
à succeeding two chapters are concerned with meas 
Te nid Hi MERON lefinitions of personality as there 
jo ; 8 TUR st as many de E ) 
personality. There are almos dni : ER DESE A 
are methods to measure personality. The term et E popu 
used to indicate social charm. A person is said to have a lot of personality, 
s ate s 
f Se he social graces. 
meaning that he possesses all of the s S SA ounitive 
Personality is 1 defined negatively as the nonability n SR N 5 
5 e is early either an aptitude 
functions. Anv measurement device that is not cle 11 1 Es 
or achievement test is often classified as a ete = e ld be of Tittle 
factory to define things in this way. For example, ab would : Pes 
lial dom. farin to be told only that a dog is something that is neither a 
p to a Martian ) 
cat nor a cow. : tions arsonali ainly 
i ifficult to give concise definitions of personality, mainly 
It proves very difficu 5 attributes which could be so 
because there are many kinds of human attribu 
labeled, A ty pical definition is "personality is the total ne jen 
a A y n 4 5 is Ruin 'ertainlv. lacks 
vidual interacting with his environment." This definition eae ! : 
little in terms of inclusiveness, but it provides no starting point F 1e 
8 ^ ieri ^ best star y d 
measurement of personality characteristics. The be st starting Vici lis 1 
discussion of personality measurement is to distinguish some of the di es 
ent kind " : rsonal characteristics which can be studied. In the preced- 
<ir sc TS ý ea 3 er i Ps 2 i arests "ere 
ing chapter de human attributes of opinions, attitudes, and interests WE Te 
define d d treated separately from the kinds of attributes which will be 
^d and treated separately à 
i succeedi vo chapters. 
treated here and in the succeeding tv : "T 
The following three kinds of attributes are most often discussed under 
5:follow 
the term personality: ; NUM ES 
1.0 di 771 1075 The characteristic behavior of an individual in respect 
PME AUN A its are honesty, gregariousness, shyness, domi- 
to other people Typical traits are honesty, greg: : gS) 
pend Social traits are often referred to as the 
nance, talkativeness, and humor. Social traits lividual appears i "e" 
surface laver of personality, the way that the individua a in society. 
E: i ; dial traits are said to concern > 
Inventories which seek to measure social traits are s the 
normal personality. í T — 
? 7 vers a broad range of disorders inc 9 
2 Maladjustment. This covers a broac r DE 5 lud ing 
hypertension, lack of emotional control, intense fears, e isruption of 
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thought Processes, and depression, among others. Althougli A a 
of maladjustment depends in part on the values in a paruen j is severely 
Person is usually said to be maladjusted if his own felit is 0 person 
disturbing to himself and/or the people around him. W hien E he 15 
becomes highly distraught or his behavior is odd or dangerous, 
classified as mentally ill, 

3. Dynamics. A difficult to de 
vidual processes which produce d 
ent kinds of social tr: 
Oedipus complex, am 


fine collection of the underlying int” 
ifferent states of adjustment and c eal 
aits. Dynamics involves such things as 8 
ount of ego strength, body image, dream sym 


. rec 
i : ly inferr 
—none of which can be observed directly but must be complexly 

from the individual's behavior, 


eee 
The instruments which will be discussed in this chapter are pama a 
concerned with social traits and maladjustment, Although they o the 
to some extent to dynamics, the projective tests to be discussed H 
next chapter are the primary instruments in the realm of dynamics sdl 
A careful distinction should be made between inventories for The 
description and other kinds of paper-and-pencil personality € 5505 
instruments which will be discussed in this chapter are all standarc P o 
cedures for having the individual Sày what he is like. Typical e 
Do you lead the discussion when talking to others?” “Do D Pind 
competition?” “Are you attractive to the other sex?” There are other Kon: 
of personality inventories Which are not concerned with self-descrip 
An example is a measure of Suggestibility, or 
the subject is Presented a seri 
asked to indicate Whether he 
the person Who tends to ag 
Some inventories Which are 
will be discusse 


“acquiescence,” in mee 
es of highly ambiguous statements "ee 
agrees or disagrees with each (Suppo; w). 
ree in this situation iş more "acquiescen ion 
not directly concerned with self-descripti 

d in the next two chapters. " hundreds 
ality inventories are currently available; ss bave 
have seen general use during the last forty years, and 1 
been employed for particular research and selection oma Va i 
quently, the inventories which will be discussed in this 5 aa 

examples chosen only to represent different kinds of testing pro 


MALADJUSTMENT 


x ; ; T ne 
Self. report is One of the basic tools in the 5 . 
Physician asks, “Where does it hurt?” “How is your prep "Ia an of 
Volunteers information like, “I don’t sleep so els weil : dui clinical 
pep." The self-report technique carried over into . 5 85 dh Sane Tas 
Psychology, The questions are different but the 5 or feels uncom- 
Psychiatrist asks the patient whether he has nig) 
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fortable in a crowd; he inquires how the patient gets along with parents 
and friends. Over the years it became apparent that certain questions are 
more successful than others in detecting maladjustment. These have 
become almost standard questions for all interviews, used along with 
questions that have particular relevance to a case. It is an easy jump 
from a list of standard questions to a test: all that is necessary is to write 
the questions down and have the subject indicate his agreement or dis- 
agreement with each. This is exactly how the first personality inventories 
were developed. 

The first self-description inventory that achieved prominence was 
Woodworth's Personal Data Sheet (19). The inventory was developed 
às a means of weeding out emotionally unstable persons from the United 
States Army. A standard, time-saving procedure was needed because of 
the shortage of trained interviewers. The inventory contains 116 questions 
concerning "neurotic" tendencies, ten of which are as follows: 

Are you troubled with dreams about your work? 

Do you often have the feeling of suffocating? 

Have you ever had fits of dizziness? 

Did you ever have convulsions? 

Did you have a happy childhood? 

Have you ever seen a vision? eee 

Did you ever have a strong desire to commit suicide? 

Can you stand the sight of blood? 

Are you troubled by the idea that people are watching you on the 
street? 

Does it make you uneasy to sit in a small room with the door shut? 

The questions were obtained from a search of the psychiatric literature 
and from conferences with psychiatrists. A neurotic tendency score is 
obtained for each person by adding the number of “neurotic” responses. 

A small amount of research was undertaken to standardize the Personal 
Data Sheet. Items were eliminated if more than 25 per cent of normal 
individuals gave the “neurotic” response. Comparisons were made be- 
tween the responses given by unselected draftees and those given by a 
Small group of declared neurotic soldiers. Woodworth recognized that 
the validity of the Personal Data Sheet depends on the diagnostic value 
of the questions which are asked and on the honest reporting of the 
subject. The inventory can easily be faked by anyone who chooses to 
do so. 

The Personal Data Sheet was not considered to be a test in the strict 
sense of the word. Persons who gave more than 30 or 40 ^neurotic" 
responses were brought in for detailed psychiatric interviews. Although 
little direct evidence for validity was obtained, people who worked with 
the Personal Data Sheet during the First World War were generally 
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Satisfied with the inven 
First World War 
kinds, personalit 
modeled direct] 


" x the 
tory as an aid to psychiatric screening. pue all 
an interest developed in the construction of ste were 
y inventories included. Most of the . using 
y after the Personal Data Sheet, to the extent o 


" ata 
nent Inventory, A recognized weakness of the 1 
at it provides only one over-all index of ve erp m 
maladjustment, Providing no information about the kind 1 four 
ment. The Bell inventory (2) seeks to Measure adjustment in 4 min 
areas of home adjustment, health adjustment, social adjustmen ventory 
tional adjustment, Plus a score for total adjustment. The dame 
Was not only one of the first instruments to measure differentia P alvsis 
ment, it was one of the first to be constructed jn terms of 1 
statistics. Bell had originally hypothesized 10 adjustment areas. ok 
were either Selected from previous inventories or composed Ip erde 
each of the 10 areas. The items were administered to groups of E We 
Items were retained in each adjustment area only if they Rubin: hung 
the total score in that area. Only 4 of the 10 hypothesized arcas items 
together well enough to he included in the final form. Ensuring oe are 
in each adjustment area are homogeneous does not mean that they hat 
necessarily valid for any purpose, but it does give some assurance 

the items tend to measure the same underlying trait, he last 
The Bell Adjustment Inventory has been used widely during pe 
twenty years as an aid in the counseling of high school and wes 92 
students. The inventory is administered routinely in many ee pea the 
freshmen Students. Those who have low Scores in one or several o 


e iel 
: X i d £n seling an 
adjustment areas are brought In for interviews to see if counseling 
other help are required, 


Minnesota Multipha 


> MMPI 
sic Personality Inventory (MMPI). The A 
(13, 14) represents the 


T in 
apex of research and detailed test atte He 
the area of adjustment inventories. Research on the instrument oe dee 
on for twenty years now, and hundreds of journal articles have eet to 
voted to its Construction, refinement, and use. The MMPI is 55 
measure the relative Presence or absence of the pu forms ‘et nental 
illness listed below. Two related items are shown for each type of m 


i are likely 
illness, A plus sign means that persons who have the illness are 
to agree with the item; 


i ans that they are likely to 
a negative sign means that t y y 
disagree, i imagined 
H ypochondriasis (Hs). Overconcern with body functions and im g 
illness, 


Related items: 
I do not tire uickly, (—) " 
The top of a head sometimes feels tender. (+) 


sous d 
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Depression ( D). This is used in the conventional sense to imply strong 
feelings of “blueness,” despondence, and worthlessness. 
Related items: (4) 
I am easily awakened by noise. ( TE 
Everything is turning out just as the prophets of the Bible said it 
would. (+) " UE or ntu 
Hysteria (Hy). The development of physical disorders such as blind 
ness, paralysis, and vomiting as an escape from emotional problems. 
Related items: : 1 i 
I am likely not to speak to people until oe to me. (+) 
andi "er it soon. 
I get mad easily and then get over i » , Aa 
Pai MAY deviate (Pd). An individual who lacks conscience, ot 
has little regard for the feelings of others, and who gets into trouble 
ga 
frequently, 
Related items: 8 
My family does not like the work I sn ware AU 
í : ER > does bother me. (4- 
What others think of me does not . e 
Paranoia (Pa). Extreme suspiciousness to the point of imagining 
elaborate plots. 
Related. items: L) 
I am sure I am being talked about. (4 
Someone has control over my mind. (4-) 8 
Psychasthenia (Pt). Strong fears and compulsions. 
Related items: " . 1 G 
I easily become impatient with people. hs odd 
I wish I could be as happy as others seem to be. t à ; 
Schizophrenia (Sc). Bizarre thoughts and actions, out of communi- 
cation with the world. 
Related items: 
I have never been in love with anyone. (+) 
'ed my mother. (—) ; 
e us inability to concentrate on one thing 
or more than a moment. 
Related items: È ^ à 
I don't blame anvone for trying to grab everything he can get in this 
world. (4-) ‘ 
When I get bored I like to stir up some excitement. (+) 
Masculinity-femininity (Mf). The relative balance of male versus 
female interests. 
Related items: 
I like movie love scenes. (F) 
I used to keep a diary. (F) : 
There are several noteworthy features of the MMPI which set it above 
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most inventories which 


iduitv 
, ace validity 

are used to detect maladjustment. Face 
Was not a concern in th 


€ construction of the instrument as it is ton oe 
with other inventories, The scales used to measure the nine j aie 
mental illness were developed on an empirical basis by criterion ame 
A large Broup of items was administered initially to several eae 
normal persons and to groups of mental hospital patients whose sy! 


/ses were 
toms matched one of the nine kinds of mental illness, Item analy ce 
undertaken to find the Scoring key for each illness scale which woulc 
differentiate the pati 


types 
type from normals and from other typ 
of patients. 


ying Procedure used on the MMPI is much like at 
used to deve] ales. Criterion keying is ap 
ous in that it Promotes the predictive Validity of an instrument, 1. 
places little dependence on the insight and krankness of the ane 
In addition to the nine mental illness Seales available on the MM! ff 
four so-called “validity scores” are used. These Provide some informat 
about the test-taking attitude of the subject and the relative honesty wit 
which his respor “validity scores” are as follows: 


: : : P 2 s markec 
The Question Score(?), This consists of the number of items mark 


: y erson 
The interpretation is that if a eae 

has a high ques scale scores for the different kinds 

illness appear lower than they should be. If a person has as many * 


“oe " EE cae is assumed 
130 “cannot say" Tesponses, the individual's test record is assun 
to be invalid, 


The Lie Score (L). This scale consists of 15 items concerning socially 
desirable act; i W people could truthfully endorse; vos 
i assumption is that an individual wh 
umerous items of this kind is falsifying the inventory. 


n sed 
The Validity Score (T). This consists of 64 items which are endorsed 
infrequently by norm: j 


al subjects. They concern a hodgepodge : 

Symptoms which are not likely to occur in any one mental illness. 

le interpretation is that a Person who endorses a number of these 
items is Careless or 


I$ consists of 30 items which were found 
tients whose scale scores appeared normal 
actually normal. The responses to these 
an be used to correct the illness scores for particular persons. 

assumed that “validity scores” like those on the MMPI 
are a panacea in the use of personality inventories, There is not enongh 
evidence yet to Support the contention that the validity scores actua y 
measure what they are purported to measure. Tie Tenn, lie, ant 

idity” Scores may indicate that a respondent oe qus eac ing rë: 
Sponses, but nothing can be done except to throw away the test record. 
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The correction score (K) offers a more positive approach, but not enough 
research has been done to say how well it works. 

The results of the MMPI can be plotted as a profile showing scores on 
the nine illness scales and those on the four “validity” scales (see Figure 
14-1 for illustrative profiles). It is seldom found that a person scores 


Ficunz 14-1, MMPI profiles for a normal adult (==) and for a “typical psychotic” 
= (Adapted from Gough, 11, pp. 554 and 563; reproduced by permission of The 


Ronald Press Company.) 
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high (the maladjusted direction) on only one of the scales. Typically, 
the maladjustment spreads across several of the scales. Some of the scales 
Correlate substantially with the others, which is to be expected because 
mental illness seldom occurs as one specific pattern of traits. It is usually 
the case that the patient has a mixture of different kinds of mental illness. 

In order to interpret the MMPI profile, complex pattern scoring meth- 
ods have been devised (14). Even with these it is necessary to have an 
experienced clinical psychologist to interpret the results. As is true of 
many clinical methods, a complex lore has developed about the meaning 
of different kinds of MMPI profiles, much of which has only slight 
grounding in empirical fact. 

Much stands or falls on the eventual validity of the MMPI. Never be- 
fore (and perhaps never again) has so much careful research gone into 
the empirical derivation of a self-description inventory to measure differ- 
ent kinds of maladjustment. The success of this venture will have a strong 
effect on the future course of personality measurement. The present 
evidence indicates that the MMPI is better than any comparable mal- 


adjustment inventory but not nearly as effective as was hoped fór 
initially. 
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SOCIAL TRA ITS 


social traits 
The difference between measures of maladjustment and ee of 
is more a matter of degree than a hard and fast a also 
the inventories Which are primarily. concerned with Tess Iiustment: 
contain one or more scales concerning different kinds of e dun cea 
The extremes of Most social traits border on maladjustment. For e 


yma 
ifestation of no 
shyness, unti] it becomes extreme, can be a manifestation 

behavior, 


Allport Ascendance. 


iest 
r » earlies 
Submission Reaction Study. One of tho 
self-description invente 


is the 
"ries and one that is stil] very much with 18 1 
ascendance-submi sion (A.S) inventory developed by the 8 an 
Its Purpose is to measure the tendency of a person either ei by them. 
dominate his friends and associates or to be led and dominated v either 
Each item describes a ituation in which the respondent can show 


e as 
om S. S 
. . 3 strative item 1 
dominance or submission An illustrative i 
follows: 


Someone tri 


to some degree, 


s to push ahead of 
for some time, and can’t wa 


the same Sex as yourself, 
Remonstrate with the 
Call the atte 


spe 1€ 
you in line. You have been borse 
tit much longer. Suppose the intru 
do vou usually: 
intruder 


ntion of the man at the 
ticket window Ss 
“Look daggers” at the intruder or make 
clearly audible comments — 
Decide 


not to wait 

Do nothing „„ d in such 

Scoring Weights for the A c inventory items were 5 by 

a Way as to differentiate Persons who were rated high in ro RE 

themselves and by associates, from those who were rated Hw KM last 
nance. The A-5 inventory has been used extensively during 


5 it can 
: : i occasion, it 
twenty years, and the research findings show that, on c 

Serve as a useful and valid measure, 


and go away 


st ap- 
ity Inventory. Bernreuter’s inventory (3) 1 
as been in wide use since that e is p n" 
in one inventory the information m 
a number of other inventories. ve Biwi 9 5 8 
‘rent tests of that day and carnposed senring 05 s ont 
"neuroticism, “self-sufficiency, E GA hey da 
only evidence of validity offered by er © EE Pose 

y ntory correlate highly with the parent inventories fi 

ins were drawn. 


peared in 1939 and h 
Was to combine 
Obtained from 


Which the ite 
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One of the difficulties exemplified by the Bernreuter Personality In- 
ventory is that of obtaining independent measures of different traits. 
If, as on the Bernreuter inventory, there are four scales, it is expected 
that the four scores concern at least partially different traits. This is far 
from the case on the Bernreuter and on many other personality inven- 
tories, Introversion and neuroticism scores on the Bernreuter correlate 
95, meaning that they are almost totally identical. 

Beginning in the 1930s and continuing to this time, factor analysis and 
similar procedures have been used extensively to construct differential 
personality inventories. Flanagan's (9) factor analysis of the Bernreuter 
scales set the stage for much such work to follow. He found that two 
factors account for nearly all of the variance in the four scales which 
Bernreuter employs. He called these two factors “self-confidence” and 
“sociability.” Bernreuter added scoring keys for Flanagan’s two factors, 
which are used in addition to the other four scales. Six names do not 
constitute six traits. When the inventory is used, only the two scales 
developed by Flanagan should be scored. 

M ultifactor Batteries. An extensive effort has been under way to chart 
the major factors arising from self-description inventories. The hope is 
to find a limited number of traits that account for the scores obtained 
from diverse inventories. A series of studies by Guilford and his asso- 
Ciates laid the groundwork in this field. He first collected 35 statements 
which were purported by different authorities to be primary aspects of 
introversion-extraversion. These were made into an inventory and ad- 
ministered to groups of subjects. A factor analysis of the items produced 
five factors instead of the single continuum along which introversion- 
extraversion is commonly judged. Successive factor analyses were per- 
formed on new sets of items until 13 factors were found in all. The 10 
Most prominent factors appear in the Guilford-Zimmerman Temperament 
Survey (12), The factor names and descriptions are as follows: 


G, General Activity. High energy. quickness of action, liking for 
speed, and efficiency 

R, Restraint. Deliberate, serious-minded, persistent 

A, Ascendance. Leadership, initiative, persuasiveness 

S, Sociabilitij. Having many friends, and liking social activities 

E, Emotional Stability. Composure, cheerfulness, evenness of 
moods 

O, Objectivity. Freedom from suspiciousness, from hypersensitivity, 


and from getting into trouble 

F, Friendliness. Respect for others, acceptance of domination, 
toleration of hostility 

T, Thoughtfulness. Reflective, meditative, observing of self and 
others 
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P, Personal relations. Toler. 
tions, freedom from f 
M, Masculinity, Interest 
easily disgusted, 
tionally expressive oii Pen: 

Two other widely used multifactor inventories are The Sixteen 


i rstone 
ality Factor Questionnaire (5) developed by Cattell and the Thu 
Temperament Schedule (18) 


Special Areas, In te 
are constructed for p 
for commercial distri} 


F in social institu- 
ance of people, faith in social 
aultfinding and from self-pity Soter. anti 

PPS e 5 

in masculine activities, harc e 8 
N Fön anti A 
versus (for femininity ) romantic 


inventories 
rms of numbers, most of the personality 1 
articular research or selection problems wis prain 
ution. These Tange from instruments to 5 em a 
damage to inventories for the selection of sales clerks, There sgh some 
boom in the Construction of inventories for industrial selection. 
organizations, everyone from the janitor to the 
make his marks ona self-descriptic 


st 
‘dent mus 
company pud 
n inventory, Much of the wor l jour- 

i ial i ies j t ‘ essioni 

is done on industria] Inventories is not Teported on in professiona 


at such 
nals or in other readily available Sources, The reason for this is xd onn 
specialized inventories are not of interest to many people, and a Pd from 
is a desire in many cases to prevent competing commercial concer! 
adopting the same or similar inventories, 
A typical inventory developed for a s 
of Burgess and Cottrell (4) used to mea 
the inventory items are as follows (mult 
plied for each item); 
Do you kiss your husb 
Do you ever wish you had not married? 
What things annoy and diss; 
If you had your life 
the same person? 
Do husband and wife e 


is that 
1 T em is u 
pecial research „ E 
isure marital happiness. op: 
iple choice alternatives ar 


and (wife) every day? 


sre? 
j rriage? 
Atisfy vou most about your aa LAMP 
to live over, do you think you wou 


a z r? 
ngage in outside interests together a sation 
The Inventory was administered to married couples to measur e to deter- 
of marital adjustment, The responses were used as a criterion 


"^ $ r of mar- 
mine the predictive validity of background data as a forecaster o 
riage happiness, 


EVALUATION OF SELF-DESCRIPTION INVENTORIES 


er 
; TET ave been passed ov 
The inventories discussed so far in this chapter h Į 


iabili E alidity, 
0 at a 
and uses for Particular instruments. The reas hd it would 
x 5 : e ld not be worth Ne space 
Iscussion of se arate inventories wou iption inv. ries have 
take. It is not ice critical to say that self-description iny entories h 
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mainly produced disappointing results (see 7, 8, 10, 17). The remainder 
of the chapter will consider some of the reasons why self-description 
inventories meet with difficulty. 

Reliability. Self-description inventories are usually less reliable than 
tests of aptitude, achievement, interests, and attitude. There are numer- 
ous exceptions to this rule, but self-description inventories tend to have 
lower reliability than most other psychological measures. Reliability co- 
efficients for the better-established self-description inventories usually 
range between .75 and .85. Well-established aptitude and achievement 
tests usually have reliability coefficients between .85 and .95. The reli- 
ability of self-description inventories is usually not so low as to render 
them unusable, but the amount of measurement error which is often 
Present is enough to materially impair the validity. 

As is true in all tests, the reliability of self-description inventories can 
usually be raised by increasing the number of items. Some inventories 
employ large numbers of items in order to make the results reliable. 
Individual items on self-description inventories are usually less reliable 
than aptitude and achievement items. It is not uncommon to find, for 
example, that a 20-item vocabulary or mathematics test has a reliability 
Over .85. It is usually necessary to have twice that many items in a self- 
description inventory to reach the .85 level of reliability, if that level can 


be reached at all. 

Validity. One of the major observations to make about self-description 
inventories is that in many cases almost no effort is made to investigate 
validity, Many of the inventories are founded on "face validity." The 
inventory items are often so directly related to the trait being measured— 
€g., “Do you often feel depressed?”—that it is an easy step to assume 
that inventories measure what they seem to measure. Face validity is not 
Only insufficient evidence of validity, it can actually be a harmful quality 
in an inventory, The more transparent the items and the more they look 
like the characteristic to be measured, the more they are open to dis- 
tortion by the respondent. - UC 

Another approach to the determination of validity has been to corre- 
late one self-description inventory with others. This is not necessarily an 
indication of validity. If self-description inventories correlate with one 
another, this may mean only that people are consistent in applying mis- 
conceptions about themselves or that they are consistent in purposely 
distorting their own characteristics. í 

Most self-description inventories require an investigation of predictive 
validity (the term “prediction” being taken broadly to mean a func- 
tional relation with other forms of behavior either in the past, at present 
or in the future). The great difficulty in validating many self-description 


332 Tests and M easurements 


inventories is that there is 
example, what should the 
man inventory predict? 
The adjustment-mal 
establish validity than 


" For 
: edict. 
almost nothing available To od Amber 
"friendliness" factor on the Guilfor 


position to 
adjustment inventories are in a better ent 
are the inventories for social traits. ^ F empie 
which is often used to validate maladjustment inventories A with 
the scores of contrasting groups of Persons—mental hospital Lis general. 
People outside, Psychiatric rejects from the Army with saldicrs 8 general. 
Persons who suffer from Psychosomatic ailments with people i contrast- 
Even if an inventory successfully distinguishes the members 5 s in the 
ing groups, it might not be able to make precise distinction 
borderline area between adjustment and maladjustment, as with con- 

It is sometimes Possible to validate social trait inventories 5 aged t0 
trasting Broups. For example, a “leadership” inventory can E se who 
distinguish persons who achieve positions of leadership from tho ign, 
do not. In Spite of the incompleteness of the * 
it is one of the best methods of valid 
personality inventories, 

Personality inve 
used to select pe 
ing. If a reliable Success crite 
predictive validity of self-de. 
be done by corre] 


;" des 
‘contrasted 1195 5 r many 
ation currently available fo 
ntories can be 
sons for particu] 


are 
s ; n the 
Validated most easily when they 


» rain- 
ar jobs, Positions, and courses 7 — the 
rion is available, it can be used to hes can 
Scription inventories, For lou a by 
ating scores made on a self-description inventor) 
Prospective salesmen with the amount of sales made later. ies can 

In those cases where the Validity of self-description ineat rrela- 
be determine either by the method of contrasted groups or by patie 
tion with a continuous criterion, the inventories usually show (pie oed 
ing results, There are isolated Cases of self. description inventories nd far 
prove themselves valid to a reasonable extent, but these are few 1 r self- 
between, In addition to the generally low level of validity found kodie 
description inventories, the validity coefficients tend to change et to 
with variations in testing circumstances, A common practice has 'orking 
administer elf. description inventories to people who are already ae 
on a job, and to validate the inventory by correlating it with aia 
of job Success, A Substantia] correlation found in this way Is no gua ledi 
that the Same results will be found when the inira d cine 5e pa 10 
People for jobs. When people are applying for < Joh. eid demit 
appear suited for the Position, and the pee paea Ak the best 
answered in such a Way as to make the applicant appe; 


the sa ressure 
light. People who are already on the job are not under the same Doe 
to fake an inventory, 
The responses th 
valid. 3 


x 7 re the 
k 'rsonality Inventories and 

at people give to pe NES NUR Le oreet 
ity of inventories in change with subtle changes in the testing 
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environment. The author! found that in military settings the resposes 
to self-description inventories are sometimes different when ty ^s 
administered by civilians rather than by officers and different w hen the 
respondents are told that the results will become part of dede i 
records than when thev are told that the results are to be used strictly 
for research purposes. Even if an inventory proves valid 55 = bai 
Situation, it must be revalidated for any other situation in which it is use 
even if the two situations appear quite similar. e 
Self- description inventories have so far been of no pum = E 5 8 
prediction of school success. Although personality must cer 2 yP i5 
important element in school work, the author knows of 15 " i d 
Which a self-description inventory has added materially to; t 9 anye 
efficiency of conventional aptitude batteries. 5 is de 
Work out better as aids to student counseling and guic isl T Rn 
Who rates himself as unhappy, lacking in ien ene bei EA à pa 
friends is likelv to be in need of help. Persons d bh ir 2 m 
maladjusted are good prospects for more eres i $ pud 5 
guarantee that the student who e 2 a ue Hé may 
ànd abundantly possessed with friends is raal y ie Jie ma 2b diei 
actually be well adjusted or he may be as sick or sic f pers 


who relates his troubles Samos Scy A EST varies con- 

The etn of self-description inventories in m pier dr ve 
Siderably with the job (see 10). Sct ag Se Me he iki 9 75 Wall for 
in the selection of salesmen and sales clerks anc ei m el tion of 
clerical workers and skilled trades; they work pond ein ection o 
Supervisors, foremen, and service workers 9 | inte RE de- 

Language difficulties. The validity of 5 'hich items ar 
Pends to a considerable extent on the Clarity 5 Ne ally 
phrased, Consider, for example, a toss gren pen wi as 1 
lead the discussion in 3 N oor ee a te 
What is meant by "usually"—60 per ce AME x 
time, 8 0 85 ees of the time. Does the lows nef 2 85 1 
the most, make the most important points, or have the ia od MEN 
the phrase "group situation" pertain only to formal Bats gi yi club 
meetings, Gr dogs it include casual discussions among io ji This ane 
be overdoing the difficulties of communicating social traits, but 2d us- 
trates the need for language clarity in the phrasing of self-description 
items. 

In addition to the difficulty of wording items clearly, test constructors 
must be careful in describing the traits measured by inventories. This 
is true both of the inventories derived by factor analysis and those which 
are constructed on a “rational” basis. Some of the trait names appearing 

* Unpublished re 


search. 
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on current inventories are esoteric and confusing, such as ate 
‘adventurous cyclothemia.” A school psychologist who MSS se easured 
tion inventories must know the Meaning of the traits being 7 5 con- 
before the results can be put to any valid use. The problem usi com- 
fined to Self-description inventories, Aptitude factors like ver jn test 
prehension and perceptual speed might be misunderstood a ning 0 
user, but there is more of a problem in communicating the mea 
personality factors. 

Nunnally and Husek (1 
in three widely used multi 
in one battery were 


interpretations 
5) studied the clarity of factor e Hose 
factor inventories, They found that the 


in the 
z j e in 
Considerably less understandable than thc died 
11 A ses > mear 
other two, Because of the difficulties in communicating the me: 


“ity 0 
personality traits, an attempt should be made to determine the nid 
factor interpretations before inventories are distributed commercia T per- 
A serious consequence of the difficulty of naming and fps tha 
sonality traits such as those found in the multifactor inventories i: have 
different inventories which Purport to me; sure the same trait m resent 
little in common, Correlations are sometimes very low between di ontly, 
inventories used to measure a trait such as introversion. sper ane 
whether or not 4 Person is said to he introverted depends on the invento" 


which is used, 
Trait Clusters, Little is 


cular 
self-description ite 


rom knowing responses to Lys sters 
items evolve into more general c oni 
are a relatively small number of 1 r 
ie the many specific social traits. Factor pe 
is the tool which is used most often to search for clusters of related p 
Unfortunately. the factor-analytic work with self-description items 
not borne t] 


8 5 apti- 
ne useful results which have been found in studies of aj 
tudes, interests, and attitudes, 


ively 
Correlations between elf. description items tend to be low. ROE 
unclear factor solutions are often obtained, Different investigators nc 
different numbers of factors, The factors are often difficult to name) xs 
different investigators use different names. Although there is still c an 
troversy about how many and what kinds of factors exist in a 
abilities, there is enough confirming evidence and agreement amg 975 
investigators to talk as we did in Chapter 9 about the d ich 
human abilities, There is as yet no comparable factor structure whi 
can be said to underlie self-description inventories. —á 

The difficulty in finding self-description factors is not statistica 1 4 
due to the nature of social traits. Social traits 3 e vi S- 
highly individual manner. In order to find pistons ees d : ies 
sary for a group of traits to go together in the ticus A stl is 
Ik, for example, there is a general trait of dominan 5 P 
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dominant with women should be dominant with men, dominant at work 
and dominant at home, dominant when placed in one situation as well 
as in any other. However, people are often dominant and submissive in 
highly individual ways. The person who is dominant with women may be 
submissive with men, dominant at home, but submissive at work, domi- 
nant among strangers but submissive with people who know him well. 
The particularity of social trait patterns in individuals is not so marked 
às to spoil all efforts at determining trait clusters, but particularization is 
present to a degree that markedly hinders the successful use of self- 
description inventories. 

The Acceptability Influence. More important than the other roadblocks 
to the construction of self-description inventories is the fact that the re- 
Spondent can and usually does control the responses for his own ends. 
There is a strong drive ‘in all of us to appear socially acceptable and 
acceptable to ourselves. Acceptability in our society means being in- 
telligent, courageous, courteous, kind, dominant, and so on. It is extremely 
difficult for an individual to admit to himself that he is ignorant, cowardly, 
rude, mean, and submissive. It is even more difficult for an individual to 
admit these failings publicly. People in general tend to describe them- 
selves in rosy terms, to an extent that lowers the diagnostic value of self- 
description inventories. 

A study by Edwards (6) demonstrates the extent to which the “ac- 
ceptability influence” dominates the responses made to self-description 
inventories. In the first part of his study, 152 subjects rated the social 
acceptability of each of 140 personality trait items on a nine-point scale. 
Scale values for the items were determined, showing the relative nega- 
tive or positive social value attributed to the traits. Next, Edwards made 
the 140 items into a personality inventory. A group of students was 
asked to indicate "yes" for each item that characterized them and *no" 


for items that did not characterize them. The proportion of persons 
as determined. Edwards found a correla- 


answering "yes" to each item w w ö 
tion of .87 between the judged social acceptability of items and the 
proportion of people who endorsed the items. This is strong evidence that 
people generally try to describe themselves in a socially acceptable man- 
her on personality inventories. ^ : 
What is acceptable behavior for particular persons varies somewhat 
in terms of age, sex, vocation, and social position. The soldier needs to 
ý $ H e 
be tough and stern, whereas the minister is more m need of the virtues 
i E $ ants are ally aware e K : 
of kindness and patience. Respondents are usually aware enough of the 
characteristics needed in a particular situation to fake in the required 
direction. 
The oe of the acceptability influence in self-description inven- 
tories depends on the degree to which the item alternatives differ in social 
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a military Screening ing key 
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chiatric patients from - À typical item is as follows: 
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, success 4 
ems to meet with some su 
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aid is the Shipley Personal Inventory (16). Th 
chosen to be of ne; 


normal persons 
I have felt bad more 


from head colds, 
have felt bad more 


— 
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The forced-choice te 
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H of 
à promising lead in the perse ei 
no means a cure for all of Pun 1 
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personality characteristics will rest largely on testing procedures other 
than self-description inventories. 
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CHAPTER 15 


Projective Techniques 


ement of 
an approach to the perpe oe 
Terent from that of the Mea ne 
iption inventories require Sas to dé 
techniques require the ve e ae 
an himself, The projective tee „ un- 
an individual's responses to 8^ secta 
are influenceg by his needs, motives, fears, exp 


the self-descr. 
If, the projective 
t objects other th 
€ hypothesis that 
Structured" stimulus 
tions, and concerns, referred 
If there is an agreed-on public meaning for a stimulus, it is ae for 
to as a structured stimulus. If there is no agreed-on public pepe in- 
in consequence there is considerable latitude lus. 
tation, it is referred to as an unstructured p here 
us is compared with an unstructured stimulus 2 that 
15-1. First What do you see in picture AP Nearly everyone will ene 
it is a house, A few people might call it a School or even a jail, but T of à 
will Senerally agree that it is a dwelling of some kind. The yan 
house is a highly Structured Stimulus, Now what do you see in I me 
BP There is no accepted common Meaning for that stimulus Pn 
might be interpreteq asa thunderstorm, a dog, or an artist’s palet mias 
There is a considerable body of evidence to show that d ex- 
of relatively unstructured stimuli are related to moods, needs, "hungry 
pectations, Studies of the effect of food deprivation show that sent 
subjects more frequently interpret ambiguous drawings as iae blank 
ngry subjects (9). In one study a comp etely nia 
jects were led to believe that faint Images es 
as found that the number of food 1 
as the interval of food deprivation aea a Seances 
at perception is influenced by values, a em Bebe 
). An experience common to er 5 of the moment. 
of printed material in a way that indicates our „ coming the 
9r example, the Student who worries i tia evening paper that “The 
next day is likely to see in a hasty glance a 
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police will give the examination tomorrow,” whereas a careful reading 
will show that “The police will give the explanation tomorrow.” 

If the individual interprets picture B in Figure 15-1 as a thunderstorm 
rather than as a dog, this may indicate something about him personally. 
However, it is one thing to say that a response is significant and quite 
another thing to say just what it indicates about the person. 


Ficure 15-1. Comparison of a relatively structured stimulus (A) with a relatively 
unstructured stimulus (B). 


ee 
SS 


(a) (5) 


The instruments which will be discussed in this chapter are primarily 
ation of personality characteristics from responses 
to relatively unstructured stimuli. The ultimate in unstructured stimuli 
is found in one of the pictures in the Thematic Apperception Test (TAT). 
The subject is asked to make up a story about a completely blank card. 
Other stimuli are structured to some extent to obtain information about 
particular needs and concerns. For example, in an instrument used to 
Study attitudes, a picture which shows a white person and a Negro talk- 
ing can be used. The stimulus is structured to that extent and in that 
manner in order to learn about attitudes toward Negroes. 

The term "projective tests," which is used to refer to the kinds of 
instruments to be discussed in this chapter, is somewhat of a misnomer. 
The more strict meaning of the term projection is the tendency of an 
individual to see his own unwanted traits, ideas, and concerns in other 
people. For example, the individual who is himself a rather hostile per- 
son is often prone to see hostility in other people on the basis of little or 
no evidence, Here we will be concerned not only with projection in the 
5 5 88 but with the many ways in which the interpretation of 

ed stimuli can be used to detect personality characteristics. 


based on the interpret 
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RORSCHACH TECHNIQUE ean 

; js the Rorschach. 

By far the most widely used projective technique is ies a Herman 

It was developed during and after the First World W = él ink blots 

Rorschach, à Swiss psychiatrist, He experimented PEGE * nature d 
to find a set which would provide the most insight into the 


Ficure | 


5-2. An ink blot of the type 


used in the Rorschach Test. 


in in use. 
mental disorder. The 10 ink blots which he settled on are still in 
An ink blot like those used in the 


the ink blot cards 
the remaining card 
of gray. The thre mploy various colors, 


ices, 
Administration, The Rorschach, like the other projective wot 
should be administered only by a highly trained caminen SUME Sid 
will depend Very much on the examiner's skill. Although therc 8 505 1 and 
Schools of thought as to how the Rorschach should BS mo 080 
interpreted, a similar approach is used by d cgi by Klopfer 
widely used approaches are those advocated by Beck (2) ar ) 
and Kelley (8 s : a : rschach is to 
The sex PME followed in e 15 1 ib 
talk with the respondent a minute or so to gain ; dd Cobre ve: 7 
nis back to the examiner, and then introduce the task wi approximately 


Jive OÍ 

in Figure 15-2. Five of 

test is shown in Figure soy Two of 

are made in shades of black and gray on 2 shades 
S contain bright patches of red in addition to : 

e remaining cards e 
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the following instructions: “People see all sorts of things in these ink 
blot pictures; now tell me what you see, what it might be for you, what 
it makes you think of” (8, p. 32). The respondent is allowed to give as 
many interpretations as he likes. If he gives only one response and ap- 
parently tries to give no more, the examiner suggests that he look for 
other things, saying something like, “People usually see more than one 
thing.” A typical series of responses to one card would be for the subject 
to report, “Tt all looks like a bat” and “This part looks like a vase.” After 
looking at the card for a few more seconds, he might give as his final 
response, “This little bit looks like a nose.” 

After the respondent runs out of interpretations for the first card, he 
is handed the next one, and so on for all 10 cards. The examiner is kept 
busy ordering the materials, measuring the time taken to give each re- 
sponse, and writing down verbatim what the respondent says. After 
the series of ink blots is administered, the examiner conducts what is 
called the “inquiry.” This consists of asking the subject questions about 
each of his responses to learn what part of the blot is involved in the 
interpretation and why the particular part gives rise to the response. 

Scoring. There are a number of different kinds of scores given to each 
response. The major scoring categories relate to “content,” “determinants,” 
and “location.” The interpretation of a response depends on the scores 
received from all three categories. Consequently, only very general im- 
plications can be given to scores in one category alone. 

Some of the major categories of content are listed and illustrated as 


follows: 

Human. A response that would be scored human in content is “a 

man doing a dance." 
Human Detail, e.g., “a nose.” 
Animal, e.g., “a bat." 
Anatomy, e. g., “bones.” 
Clothing, e.g., “a nun’s habit. 

x n » 
Nature, e.g., “northern lights. 
tà * 

Sex, e.g., “a woman’s breast. 
Landscape, e. g., “a lake.“ 
Food, e.g., “a steak with mushrooms. 

If a person gives almost all animal responses, this is taken as an in- 
dication of low “functioning intelligence.” A predominance of anatomy 
Tesponses is usually interpreted as an indication of depressive tendency 
or of overconcern with health. If an individual gives many more responses 
in one content category than in others, e.g, many food responses, this 
is taken as evidence of an especial personality need. 

Location refers to the part of the ink blot which instigates a particular 
response. is i ee M ; PENIS 

ponse. This information is usually obtained in the inquiry. The re- 
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Sponse is scored W if the whole 


ical W 
" se. 7 bical y 

blot is used in the response. Typ 
responses are “a bat,” 


j f 

“a sky full of dark clouds,” and “a oe 
flowers.” The response is scored D if the location is a „ a 
ceived subdivision of the total blot. Typical D responses are hie sant 
large part of an ink blot “a bird’s head," and another part d : N an 
blot “a statue.” The response is scored Dd if the location is a 5 to in- 
usually unnoticed Dart of the ink blot, A typical Dd response 1$ 
terpret a small patch of gray off from the main figure as “a face, mental 
The location of responses is thought to indicate the subjects siders 
approach to life, whether he de : 
details systematically, or gets 
unimportant side issues. The 
posedly reflect potential inte 


als in sweeping zeneralizations, Sr 0 
lost in the compulsive considera sup- 
number and kind of W ees is 
lligence. The intelligent, creative per who 
thought to give numerous, well-organized W responses. The person 

emphasizes D responses is thought to have 
fact approach to life proble 
an indic 


à atter-017 
à common-sense, eem as 
ms. An emphasis on Dd responses is ta The 
ation of compulsive tendency or mental disorganization. * 


y 
Rorschach responses of the average adult contain approximately 6 
responses, 20 D, and 4 Dd. 


* HU 
The scoring of determinants is both more complex and more subjecti"? 
than the Scoring of content If the response is determine" 
by the shape or outline € or part of the ink blot, it $i 
scored as a form response (F Sponses are further scored either 
"plus," if they are adequ 

superior people would giv 
terpretations. A response is 
tation, e.g., "two women d 
mined by the color of the 


e, or “minus,” if the 
score 


€sponse is apparently deter- 
part which is interpreted, it is scored C. If the 
response is “sea water,” and in the inquiry the subject gives as his reason, 
“because it’s green,” it would be scored as a color response, A response 
is scored Y if it is determined by the shades Of gray within a blot. À 
typical Y response is "dark clouds.” It is usually the case that a response 
must be scored on more than one determinant, For example, an FC score 
means that the response is determined mainly by form and that color 
enters as a secondary determinant, A CF response means that color is 
the primary determinant and form the secondary determinant. 

Form responses are thought to indicate a factual, rational approach 
to life. Color responses are thought to indicate emotional reactions. If a 
person gives considerably more form th : ses it is taken as 
à sign of emotional impoverishment and repression of feelings, If color 
predominates over form, it is said to indicat i 
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of imagination and “inner” life. When movement responses predominate 
over color responses, the interpretation is that the individual is intro- 
versive. If color responses are more numerous than movement responses, 
the subject is said to be extratensive, more concerned with the “outer” 
rather than the “inner” life. Shading responses are taken to indicate 
uncertainty, worry, depression, and anxiety. 

After each response has been scored in terms of content, location, and 
determinants, scores are added up over all responses, forming the “re- 
sponse summary.” The total number of responses given and the time taken 
to respond are also considered in the summary. 

Interpretation. Rorschach responses are interpreted in terms of psycho- 
analytic and other “depth” psychologies. The response summary alone is 
only a part of the material used in the interpretation. The trained ex- 
aminer takes note of many complex relationships among content, location, 
and determinants. Thus, the movement response “children playing ball” 
might be interpreted quite differently from “men playing ball” in terms 
of the other responses in the record. A response that would be interpreted 
one way for a man is often interpreted differently for a woman. If a 
person who has never finished high school gives numerous anatomical 
content responses, it might be taken as an indication of morbid thoughts. 
If a college student gives many anatomical responses, it might be inter- 
preted as interest in and familiarity with biology. 

The interpretation of Rorschach responses is an extremely complex 
task. Although there are reasonably clear-cut rules for scoring individual 
responses, there are only general standards and examples to direct the 
final interpretation. The interpretation depends heavily on the subjective 
impression of the examiner. It takes about two years of practice, usually 
working in close collaboration with experienced examiners, to become 
proficient at interpreting Rorschach responses. 


THE THEMATIC APPERCEPTION TEST (TAT) 


The TAT, developed by Murray (12) and his associates, consists of 
Pictures of people in various settings. One of the pictures is shown in 
Figure 15-3. Some of the pictures are more suited to young rather than 
older people, and some are more suited to men than women. The pic- 
tures are structured in such a way as to elicit responses concerning rela- 
tionships between various social roles and responses relating to different 
emotions. The pictures are unstructured in the sense that a wide variety 
of interpretations can be given about the feelings and actions of the 
persons shown. 

Administration. The examiner has a choice as to how many of the 
pictures and what kinds will be used with a subject. If the standard set of 
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A 0- 
Ficure 15-3. One of the pictures used in the Thematic Apperception Test. (Repr. 
duced by permission of Harvard University Press. ) 


20 pictures is used, it usually requires more than one testing session to 
complete the series. In many cases the examiner uses only the half-dozen 
or so pictures which he thinks will elicit the most pertinent anne 
from a particular subject. The subject is told to make up a story c 
each picture in turn, The instructions are approximately as follows: me 
going to show you some pictures. I want you to tell me a story about w - 
is going on in each picture. What led up to it, and what will the dines 
be?" The responses are either written down verbatim or a phonograph 
recording is made. £ 
Scoring and Interpretation. Murray originally scored TAT sae 
in terms of needs and presses. A need is something that the me 
is trying to obtain. Some of the needs listed by Murray are rhe iiit 
aggression, nurturance, passivity, and sex. Presses are the RUNS 
uences which help or hinder the attainment of needs, such as aggressio d 
dominance, and rejection by other persons. Each story can be inspecte 
for the needs and presses which are portrayed. id 
Murray’s system of scoring also considered the hero in each story a t 
the outcome, The hero is the person in the story with whom the subges 
apparently identifies. F or example, responses to Figure 15-3 could con 
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cern stories about either the older woman or the younger woman. Some- 
times the hero is a person outside of the picture. For example, a story 
about Figure 15-3 could concern a relative of one of the persons in the 
picture. 

Although needs, presses, hero, and outcomes are considered by most 
of the people who work with the TAT, apparently few examiners make 
detailed scorings of them. In fact, no formal scoring system of any kind is 
used by the majority of TAT examiners. The examiner interprets the 
responses in terms of his knowledge of personality and his experience 
with the instrument. Some interpretations are of a common-sense kind 
with which most examiners would agree. If a male subject imputes 
unfriendly motives and actions to all of the female characters, it strongly 
Suggests ‘that he is having troubled relations with women in real-life 
Situations, If all of the stories end in disappointment, embarrassment, and 
failure, it is likely that the subject feels defeated and depressed. If the 
Stories are lacking in passion and violence, even in pictures where strong 
emotion is the evident theme, it indicates that the subject is suppressing 
his own emotions. If any one type of social interaction, such as adultery, 
is seen in numerous pictures and in pictures where there is little to sug- 
gest it, this indicates that the subject is overly concerned about a par- 
ticular issue, Although there is no standard procedure of interpretation, 
responses to the TAT pictures provide many hints about the subject's 
Concerns, his conception of himself, and the way he views his human 
environment. j 

Special Uses for TAT Pictures. The original set of pictures developed 
ers is only one of many such sets of pictures 
ires can be made for many different kinds 
BE Special studies. For example, a special set of pictures could be used to 
Study the attitudes that management and labor hold toward one another. 

ictures of persons dressed like foremen, workmen, union executives, and 
vice-presidents could be made. These persons could be posed in settings 
Pertinent to the problems under study and structured to the extent that 
attitudes about certain kinds of social interactions would be elicited. Spe- 
cial sets of pictures have been composed for Negroes, Indians, and other 
Tacial and ethnic groups. Pictures of the TAT kind have been used to 
Study anti-Semitism, family relations, attitudes toward military life, and 
Many others, 

Comparison of Rorsch 
>etween the Rorschach a 
is Ologists prefer the Rorsch 
sed toward the TAT. Most testers agre 
în the kinds of things which the two techniques measure, as well as some 


Special advantages for each instrument with particular types of subjects. 


"d Murray and his cowork 
nat are presently in use. Pictt 


ach and TAT. There is considerable competition 
nd the TAT as clinical instruments. Some psy- 
ach in general; other psychologists are equally 
e that there are some differences 
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Although both instruments require an expert examiner, it is ecd 
more difficult to learn the administration and interpretation of the 
Rorschach. Although it is difficult for the subject to "fake" either of E? 
techniques, it is probably more difficult to distort responses on the 
Rorschach. or 

The Rorschach is not primarily concerned with content. That is, it " 
not very important whether a particular part of the blot is referred to asa 
snake or a rope. The important thing is whether or not the percept E a 
sensible interpretation and whether the percept is determined by the 
form or color of the area. Content is the meat of the TAT. If a character 


is described as a detective rather than as a salesman, it will markedly 
affect the interpretation. 


The Rorschach is thou 


ght to be valuable in getting at disturbances in 
the thinking processes, p 


articularly in patients who are nearing psychotic 
breakdowns. The TAT is more concerned with social adjustment than 


with thinking processes. Although there is controversy about the point, 
there are many people who believe that the Rorschach is better for 
detecting and classifying psychotic disorders, and that the TAT is better 
with neurotic disorders and milder disturbances. In m 
are given to a subject, and many clinica 
to a better all-around diagnosis th 
instruments separately. 


àny cases both tests 
l psychologists believe this leads 
an could be obtained from either of the 


OTHER PROJECTIVE TECHNIQUES 


In addition to the Rorschach 
dures for evaluating personality 
techniques. Almost anything 
preted serves to Some extent 
concerns and expectations into everything we do. 


Word Association. An old technique for learning about personality is to 
have an individual associate words. Various lists of words (7; 13, chap. 2), 
have been employed to get at particular kinds of reactions. The usual 
practice is to place certain emotionally tinged words among relatively 
neutral terms. The following list of words would be useful in studying the 
home and school adjustment of adolesce 


and the TAT, there are numerous proce- 
which are best thought of as projective 
that can be described, completed, or inter- 
as a projective test. We tend to read our 


nts: 
l. Hair —— 7. School 
2. Mother uc 8. Love 
3. Home — e 9. Tree 
4. Desk m 10. Hate 
9. Book 11. Paper 
6. Father 


— € 12. Shoe 
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13. Fight 17. Body 

14. String 18. Me — 
15. Sister 19; rollen 
16. Cake 20. Friend 


The words are read to the subject one at a time. He is asked to give the 
first word that comes into mind. The examiner records the responses and 
notes the time taken to respond. There are several ways in which the 
results are interpreted. The words which are associated often indicate the 
subjects attitudes toward persons and activities. In obvious cases where, 
for example, the association to “father” is “spanking” and the association 
to “hate” is “father,” there is the strong suggestion of a negative reaction 
to the father. Equally revealing as the associated words are the emotional 
reactions to the initial terms. If the subject takes a relatively long time to 
Tespond or gives signs of being embarrassed, a strong emotional reaction 
is suggested. Thus, even though the subject eventually supplies an innocu- 
ous association for “father,” such as “hat,” the long time taken to respond 
Would suggest “blocking” and underlying conflict. A third type of informa- 
tion used in the interpretation is the tendency to give unusual associa- 
tions, For example, most persons will associate "table with "chair" or 
respond with some other related item of furniture. It is unusual to find 
the response "tiger" to “chair,” and when numerous such associations are 
made, it might be related to mental illness. 2 - 
Sentence Completion. A technique which is similar to word association 
is the use of incomplete sentences (see 14). Examples are as follows: 


1. I dislike most to 
2. I wish that I had never ———— 
3. Most people are 
4. I become embarrassed ————— 
5. The people I like most ———— 
The usual procedure is to have subjects write inem tespanies, per- 
mitting the testing of a number of persons at one time. 1 = 
© structured to provide information on different 1 0 get ae aan j 0 
effort is made to “time” the responses. Consequently, the subject 175 i 
to make up whatever responses he chooses. The responses ze ana yd 
Similar]y to those on the TAT. That is, the moods, dein so 5 5 5: 
expectations portrayed in the responses are interpreted in respect to the 


Subject’ ; i ‘ : 

Play icu inl A method which is especially useful 3 m is 
to employ play materials as à projective device (see 3; ap ae) 0 hildren 
tray their feelings quite readily in play activities of A kin 5 almost 
any set of play materials serves as 4 “test.” Dolls and puppets are used 
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most often for this purpose. The situation can be structured by, for exam- 
ple, naming the dolls “mother,” “father,” “little brother,” “me,” and so on. 
Also the environment for the dolls can be structured to some extent by 
having present toy implements, such as a baby bottle, a toilet, a bed, jer 
clothing, and others, The child is encouraged to play with the materia 
in whatever way he chooses. A revealing set of actions would be for the 
child to place the doll for “mother” and little brother" with their faces 
to the wall while *father" gives the bottle to “me.” Play activity of this 
kind is usually very rich in suggestions about the feelings and concerns 
of children. 

Drawings, Paintings, and Sculpture. Artistic productions can be used as 
projective devices. These may be almost completely unstructured like 
finger painting (see 1, chap. 14), or they may be structured to the point 
of asking for the drawing of a man (see 11) 


. The advantage of the rela- 
tively unstructured task is that it is 


often not perceived as a test. A widely 
used procedure is to furnish clay to children and let them make whatever 
they like. The product can be analyzed for the symbolism apparent or 
simply in terms of the actions imputed to the clay figures. If the child 
makes a clay image of himself, it is important to note whether the bodily 
parts are in proper proportion. An outsized nose or excessively small arms 
might offer suggestions about the child's concept of himself. Children often 
manifest Strong emotions in artistic productions that they would not talk 
about openly. For example, a disturbed child might make a clay figure 
of mother, then run over her with the toy car, tear off the arms, and 
finally throw her in the wastebasket. j 

Finger painting is used with both adults and children. The subject is 
presented with a variety of paints and encouraged to use his fingers io 
make whatever he likes. The way that the subject approaches the task i: 
thought to be as important as what is drawn. Some persons are very reti- 
cent about putting their fingers in the paint or making a mess with the 
materials. Other subjects take a childlike glee in smearing the paints and 
making the biggest mess possible. The colors which the individual chooses 
are thought to be important. The proportioning, balance, and integration 
of the finished painting are also used in the interpretation. For example, 


it might be considered a sign of schizophrenia if the person makes only 
swirling smears with one color. 


EVALUATION OF PROJECTIVE TECHNIQUES 


Some adherents of projectiv 


e techniques feel that their instruments lay 
open the depths of person 


ality to observation. Rather sweeping generali- 
zations are often made about the response to an unstructured stimulus: 
Before the reader becomes alarmed about his own personality as it might 
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be mirrored in the projective techniques, a look should be taken at the 


factual basis for the instruments. 

Reliability. The ordinary measures of reliability are difficult to obtain 
with projective tests. If, “as on the Rorschach, component scores are 
obtained prior to the interpretation, the reliability of these can be studied 
by split-half, equivalent-form, and other techniques. The reliabilities of 
single components on the Rorschach, like number of W responses and 
number of color responses, are less than those obtained with most tests of 
human ability but not so low as to render the indices unusable (see 4). 
There are two reasons why the reliability of projective techniques cannot 
be determined from component scores. First, most of the projective 
devices employ few, if any, scores for separate responses. Second, even 
if separate responses are scored, the interpretation is the final test result, 
not the initial scores. Consequently, it is necessary to test the reliability of 
interpretations from the test. 

Before studies of the reliability of interpret 
necessary to develop objective procedures for recording interpretation. 
One way to do this is to have the Rorschach examiner make ratings of the 
Subject on various personality traits (a handy procedure for doing this 
called the Q-sort will be discussed in Chapter 17): Correlations can then 

e made between the interpretations given by one examiner on two gue 
Sions and between the interpretations given by different examiners. Very 


ew studies of this kind have been done. y , 
Beck ! found correlations between the interpretations by two examiners 

* 2 7 wn to zero. One examiner 
as to range from above .80 do a 


repeated the ratings of 10 cases a year after the first ratings were made. 
He did not remember making the earlier ratings. The correlations between 
the two interpretations ranged from 50 to .89, with a mean of : 70. Al- 
though these are only a drop in the bucket to the studies which are 
heeded, they indicate that the reliability fluctuates considerably for differ- 
ent kinds of subjects. Thus, whereas a child's responses ent be inter- 
Preted with low reliability, an adult's responses might be interpreted with 
nigh reliability; and the reliability of interpretations made about mentally 
ill persons might be higher oF Jower than interpretations made abcut 
Normal persons. The prol d in that the kinds of 
e 


blem is further complicate 
s with which reliable interpretations are obtained from the Rorschach 
May be cases for which only unreliable i 


nterpretations are obtained from 
the T :ective instruments 
AT or fr er projective instru ` 
. usually necessary that scores on most tests 


ime, it is not necessarily expected 
If the scores made by adults on 
a period of six months or 


ations can be made, it is 


9n numerous cas 


Whereas it is expected and 
Temain stable over moderate periods of t 
With the results of projective techniques. 
an intelligence test fluctuate markedly over 


1 
Pers 8 sek: 
ersonal communication from S. J- Beck. 
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even several years, the use of the instrument would be seriously 5 1 0 
However, it is to be expected that some of the personality attri ps 
mirrored in the projective techniques should sometimes change hed 
stantially in relatively short periods of time. For example, the Rorsc p 
or TAT responses of an individual would likely change from the 1 i: 
to the end of a successful psychotherapy. Instruments like the TA Ps 
probably affected to some extent by day-to-day changes in moods m d 
good and bad turns of events. Consequently, it is difficult to untangle i 
expected changes in responses from the measurement error inherent in 
testing procedures. 3 e 
Much more research needs to be done on the reliability of interpreta 
tions derived from projective techniques. Equally important to W 
the over-all reliability with which interpretations are made is the need 5 
learn the kinds of traits which are more reliably rated and those which 
are less reliably rated. For example, it might be found for one projective 
technique that interpretations regarding moods are reliable (high agree- 
ment among examiners) and that interpretations regarding future actions 
are relatively unreliable. Studies of this kind would help define the kinds 
of traits which are most reliably interpreted from different instruments. i 
It is doubtful that any one test can make the sweeping observations 
about personalities which are often claimed. Different techniques prob- 
ably have different strong and weak points as personality measures. 
Because an instrument is shown to be reliable, it does not necessarily 
mean that it is valid for any purpose; but if it can be shown that interpre- 


tations regarding certain kinds of traits are unreliable, it means that they 
cannot be valid in any sense. System 


which different projective te 
personality characteristics would help 
they measure, at least reliably, and w 
tion for subsequent studies of validity. 


Validity. Projective techniques must be classified as predictors by 
default. It is difficult to argue that the projective techniques are assess- 
ments of personality. For example, if an individual calls an ink blot a 
butterfly, there is no reason to believe that this response represents any- 
thing about his personality unless evidence is provided to prove that such 
is the case. Consequently, the validity of projective techniques can be 
determined only by correlating interpretations with important behaviors 
outside the testing situation. 

In comparison to the many applications of projective techniques, there 
are very few studies of predictive validit 


ys Consequently, no firm state- 
ments can be made concerning the validity of projective techniques as 2 
group or about the validity of particular instruments. One of the problems 


is finding assessments for the projective testers to predict. The contrasted- 
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groups study has been used most often to validate projective techniques. 
In a typical study an examiner is given the Rorschach records of both 
normal persons and individuals who are diagnosed as schizophrenic. The 
examiner must decide from the test results alone whether each person is 
from the normal or the schizophrenic group. Studies of this kind indicate 


that the Rorschach and the TAT are moderately successful in differentiat- 


ing normal persons from the mentally ill. However, it should be pointed 
The instruments are often 


out that this is a very weak test of validity. 
used to make fine distinctions between people in the normal range and to 
differentiate among types of mental illness. 

One of the difficulties in validating and improving the projective tech- 
niques is an unwillingness on the part of some devotees to subject their 
instruments to empirical investigation. A kind of cultism which encourages 
faith in the instruments rather than a healthy scientific skepticism has 
arisen among some projective testers. They would like other people to 
accept their projective devices and the elaborate interpretations which 
they make as self-evidently valid. It is encouraging to see that many 
exponents of projective techniques are aware of the need for empirical 
investigations and are busily performing the necessary research, ES 

Standardization. A test was defined earlier as a standardized situation 
Which provides the individual with a score. Do the projective techniques 
meet the requirements? In comparison to most tests, the projective tech- 
niques are relatively unstandardized. Although efforts are made to 
Standardize the presentation of material to the subject, there are inevitable 
differences in the approaches used by different examiners. Much appar- 
ently depends on the way the examiner acts and the kind of person he is. 

ith an instrument like the Rorschach, some examiners typically obtain 
More responses than other examiners, and women examiners sometimes 


Obtain responses different from those obtained by male examiners. — 
the descriptions of individual 


The final results of projective technique ripu 
Personalities, are hi hy dependent on the intuitive judgment of the 
Pa 8 catalogue all of the rules 


examiner, N i -aminers unable to 

. Not only are exa 2 4 

which they use i vendita interpretations, but they are probably not 

aware of m any cues which they employ. The examiner is not a person 
8 


whe simply administers the test and as such plays a ri role in es 
result. The examiner is part of the projective T an ig e e 
from the test materials. Some examiners are undoubtedly more effective 
than others in deriving personality descriptions. prr pi the valid- 
ity of the technique is interwoven with the Mey 7 ae ia d who 
Uses it. As long as projective techniques depen A 3 on T Intur 
tions of examiners, they will not be tests in the ipid sense. Projective 
testing is more of an art than à science, but this ie not necessarily 
Thean that the results are invalid. It does, however, force us to consider 


352 Tests and Measurements 


the validity of Separate examiners, and that immensely complicates the 
development of projective techniques for general use. -— h- 
Some efforts have been made to more fully standardize projective tec ; 
niques, particularly the Rorschach. Several multiple choice eer 
forms have been developed. A list of responses is presented with eac s 3» 
blot. The respondent chooses the alternative response which most near; 
matches what he sees in the ink blot. The alternatives for cach blot were 
chosen from the records of both normal and abnormal subjects. Although 
the earlier forms provided only an over-all adjustment score, 3 ike 
recent version (6) provides scores for different aspects of personality li 8 
those considered in the usual method of applying the instrument. Most 
Rorschach examiners prefer the traditional procedure to the multiple 
choice approach. There is not enough evidence to say how well the 
multiple choice instruments work in practice; but any systematic efforts 
to obtain more standardized procedures should be 
Special Advantages. One of the foremost advant 
niques is that most of them are difficult to fake, 
unaware of how his responses will be inte e 
to distort responses often gives himself away. It was mentioned earlier in 
connection with the word- ociation technique that an effort to cover up 
unpleasant feelings can usually be detected by the relatively long time 
taken to respond, Certain areas on Rors hach ink blots usually generate 
sexual associations, If à person avoids sexual responses, it might indicate 
that he is concerned about sex. There are many other ways in which the 
experienced examiner ¢: subject tries to hide. Even 


an detect what the 
their own responses to pro- 


encouraged. 

ages of projective tech- 
The subject is usually 
rpreted. The person who tries 


professional testers find it difficult to distort 
jective techniques. 

An advantage which is shared by most of the projective techniques is 
that they can be administered to persons of all ages, ethnic groups and 
intelligence levels, The instructions are very simple and, in most cases. 
neither reading nor writing is required. The projective techniques are 
particularly applicable to children. Children Who are unable or unwilling 
to discuss their problems directly usually react to the projective tech- 
niques as though they were games. 

Clinical Usefulness. The projective technic 
clinical practice. Thev are not likely to receive wide use in school pro- 
grams or in the selection of people for particular jobs. The examiner must 
be an expert, and there are not enough of these experts to fill the needs 
of large testing programs. It takes the better part of a day to administer 
and interpret responses on the Rorschach and the TAT; thus, it would be 
excessively expensive to test large numbers of subjects in this way. 

Projective techniques are used in making decisions about people who 
Seek psychiatric treatment and about menta] hospita] patients. Questions 


lues are used primarily in 
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must be answered about the nature and severity of the patients’ problems 
and the kinds of treatment which are likely to do the most good. These 
are very important questions concerning human lives, and all available 
sources of information should be used. Decisions about patients must 
often be based on no more than one or several interviews plus some back- 
ground information. To the extent that the projective techniques actually 
help in making these decisions, they perform a very worthwhile service. 

Summary Evaluation. The projective techniques are ingenious efforts 
to measure personality variables. Many of the interpretations that arise 
from them appeal to common sense and fit in with psychological theory 
also. Unfortunately, the techniques are relatively unstandardized, and it 
is difficult to determine how well they work. Some projective testers have 
ms for particular instruments. The tech- 
vhole personality” and the “total 
ctivities so circumscribed as re- 
about particular pictures 
complexities of human 
research are to standard- 
kinds of personality 


made unwisely sweeping clair 
niques are often said to measure the ^s 
behavior pattern." It is doubtful that a 
Sponding to 10 ink blots or making up stories 
Will lead to such broad conclusions about the 
Personalities. The indicated directions for future 
ize the projective techniques and to determine the 
attributes which each measures most effectively. 
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CHAPTER 16 


The Future of Personality Measurement 


e discussed the instruments which are 


measures of personality: self- 
hniques. It was concluded that 


In the previous two chapters, w 
qune being used most widely as 
escription inventories and projective tecl 
neither of these approaches to personality measurement has yet provided 
instruments of known validity. Whereas we have reasonably good meas- 
Ures of interests, attitudes, and opinions (which some authors would 


classify as personality variables), few, if any, proved techniques are avail- 
degree of adjustment, and personality 


p 
dynamics." This leaves a great gap in the storehouse of measurement 
at methods are most needed. Person- 


tools in an area where measuremer 

ality traits are, of course, very much involved in success and happiness 

in vocational and social life. Consequently, there is an urgent need to 
evelop adequate measures of personality traits. This chapter will discuss 

Some of the directions in which the research in personality measurement 

Seems likely to move. 

is Instead of looking for new ap e 
Possible that self-description inventories and 


e refined to the point of providing adequate in: t 
hapter 14 that the results of self-description inventories are probably 


Most valid when they are administered in research studies and not used 
to evaluate people or make decisions about them in any way. Self- 
€scription inventories may prove useful in the search for more valid 
Personality measures. For example, it may be found that a physiological 
attribute correlates with the results of a self-description inventory, say to 
Measure “dominance,” when the instruments are administered in a re- 
Search setting, Whereas the self-description inventory would likely not 
© à valid measure in nonresearch settings, like personnel-selection pro- 
grams, because of the need and ability of individuals to fake responses, 
the Physiological variable might not be controllable by subjects. 
n spite of the cleverness and good sense involved in some of the 
Projective instruments, the administration, scoring, and particularly the 
interpretation are relatively u 


ab f : 
le to measure social traits, 


rsonality measurement, it 
projective techniques can 
struments. It was said in 


proaches to pe 


nstandardized. They may provide valid 


355 
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results in particular settings, used by particular 5 b Ps ade 
to particular traits; however, it is not known when and Tog ^ hes 
best and which examiners are more proficient at interpreting the E 55 
of projective instruments. Consequently, projective techniques 1 
used for general-purpose testing until they are more fully stand: ^ 
and their validity tested for particular uses. — 
Unfortunately, there seems to be only a moderate amount of reseáreh 
interest in the standardization of projective techniques. One igi " 
the multiple choice Rorschach which was mentioned in Chapter 1a aon! 
people feel that multiple choice forms and other standardized 1 
procedures measure only the simpler human attributes and thersfune- i 
much of the "richness" inherent in the free-response projective testing 
procedure. This criticism is probably justified in that it is much omer p 
invent objective tests for simple rather than complex attributes. But pa 
of the criticism is probably due to a resistance on the part of some vu 
psychologists to objective tests as such. The clinician prizes his intuitive 
ability and his function as an observer and interpreter of human behavior. 
If these functions are taken over in part by standardized tests, the clinician 
quite naturally feels that his particular usefulness is somewhat diminished. 
However, the more important problem of using measurement techniques 
to help unhappy people should receive precedence over professional likes 
and dislikes, Consequently, every effort should be made to refine and 
standardize the projective techniques 


OBSERVATION OF BEHAVIOR 

Rather than infer personality 
projective tests, another 
behave. As a simple e 


characteristics fr 
approach is to observe 
xample, a disturbed child 
plays with a group of children, If the 
recorded and scored, the 
advantage of observation 
shared by conve 


om paper-and-pencil or 
people as they actually 
can be observed as he 
observations can be reliably 
as personality measures. The 
at it has a real-life quality not 


Y can be used 
al testing is th 
ntional testing instruments, 
Many everyday decisions about people are, 
form of observational analysis. The football co 
quarterback to see how well he Passes, kicks, 
new bank teller is judged in terms of his pro 
taining accounts, and courteousness to custom 


by her cakes and pies and by the neatness of the kitchen, The following 
sections will discuss some of the ways in which behavioral observation 
can be used to measure personality traits, 
Ratings. Rating scales are used very widely as 
ioral observations. Many industrial se 


of course, reached by a 
ach observes the freshman 
and runs with the ball. The 
mptness, accuracy in main- 
ers. The new cook is judged 


a means to record behav- 
ttings employ a standard rating form 
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which foremen use to record their observations of individual employees. 
Office managers use rating scales to judge the promptness, neatness, and 
job proficiency of clerical workers. Candidates for graduate school train- 
ing are rated by their professors in terms of general educational back- 
ground, intelligence, initiative, emotional stability, writing skill, and other 
ely in military life to judge the suitability 


traits. Rating scales are used wid 
cialized training. A typical set of rating 


of men for promotions or for spe 
scales is shown in Figure 16-1. 


Ficure 16-1. A typical set of rating scales. 
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he is able to make valid ratings about them. The majority of the people, 
however, do nothing either so meritorious or so wrong as to make their 
presence known, and therefore, only very unreliable ratings can be made 
about them. : 

In some situations the rater has considerable experience with subjects 
in a particular setting but little or no acquaintance with them in other 
settings. For example, a college professor may have a good opportunity 
to observe diligence in the classroom but no opportunity to judge social 
behavior outside the classroom. The first step in arriving at valid ratings 
is to make sure that raters have both 
jects and that they h 
traits being rated. 

In addition to lack of information about subjects, 
of bias often enter rating scales. One of these is 
particular individuals. It is difficult 
friends than to people whom we d 
source of error is a constant bias on the part of the individual rater to 
rate all persons generally high or generally low, If a number of raters are 
asked to rate 30 men, they will differ to some extent not only in their 
orderings of the men but also in their mean ratings. Some raters have a 
positive bias, tending to rate all Persons near average or above. Other 
raters have a negative bias, giving a Preponderance of low ratings. 

A form of bias called the “halo effect” consists of giving all bad, all 
ane ege, or all good ratings to people, Rather than thinking differentially 
about the strong and weak Points in persons, we tend to develop general 
attitudes toward individuals, Consequently, the individual rater is prone 
fferent rating scales even 

. xample, the soldier who is 
good in drill and m t is not necessarily good in 


an extensive acquaintance with sub- 
ave witnessed behavior in situations relevant to the 


marksmanship. 


An important point to note about ratings is that they are often unreli- 
able. The reliability of ratings can be studied by all of the methods dis- 
cussed in Chapter 6 in the same iability of mental tests is 
determined. The preferred proc the same individuals 
rated by two raters either on the same or an equivalent rating form. The 
two sets of ratings can then be correlated to obtain an estimate of the 
reliability coefficient. The coefficients obtained for ratings in this way are 
seldom as high as .80, often run below .50, and go all the way down to 
Zero in some cases. 

A number of things can be done to improve ratings. One, which was 
mentioned previously, is to provide more and better Opportunities to 
observe individuals, Secondly, raters should be trained for their jobs, told 
about the various forms of bias in ratings, and given extensive practice in 
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observing and rating individuals. Thirdly, a substantially more accurate 
Picture of the subject can usually be obtained by averaging the ratings of 
two or more raters. This serves to iron out the biases of individual raters 
and results in a more reliable measure. Using multiple raters improves the 
reliability in much the same way that increasing the number of items 
Improves the reliability of tests. Nunnally ! recently determined the reli- 
ability with which four psychiatrists rated the symptoms of 21 cases. The 
median reliability for individual psychiatrists was less than .40. When 
ratings were averaged over the four psychiatrists, the median reliability 
rose above SO. 

l Some personal skills, like military bearing, “bedside manner” of physi- 
Clans, and leadership qualities, are probably so complex that they can 
best be measured by the equally complex judgmental processes of human 
observers, However, we often make decisions about people on the rather 
Unreliable, often biased, judgment of only one rater. One rater is seldom 
Sufficient. Two or three raters provide a more reliable measurement. The 
v in precision from using more than four raters is seldom worth the 
eltort, 


; i i c by administrative officials 
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helpers. Actually, the two “helpers” were psychologists whose purpose 
was not to help but to hinder in all possible ways. One “helper” acted laz y 
and confused; the other bickered, argued, and criticized. The “helpers 
were so effective that no one ever completed the task. The real purpose 
was, of course, to observe the candidate’s behavior in a tense, frustrating 
situation. 

In another OSS situational test, the candidate was told to imagine that 
he is caught going through files marked “secret” in a government office, 
that he does not work in the particular building, and that he carries no 
identification papers. The candidate was given twelve minutes to con- 
struct an alibi for his presence in the suspicious circumstances. Then he 
was subjected to a harrowing interrogation, in which attempts were made 
to break his alibi and to make his statements appear foolish. The candi- 
date was rated on the convincingness of his story and his 
it in the interrogation. 


ability to support 


Because the OSS measurement program operated under considerable 
time pressure and because it had to fulfill the immediate task of evaluat- 
ing men, some of the procedures were relatively crude, There was little 
opportunity to measure the effectiveness of situational tests as predictors 
of later performance, One of the difficulties was finding adequate c riteria 
of performance against which to validate the tests. The major c 
were ratings by field commanders and by fellow workers. These ratings 
proved to be of low reliability, and consequently, correlations with situa- 
tional test scores were quite restricted, The effectiveness of situational 
tests in the OSS program is still a matter of controversy. 

The OSS selection program received wide attention, and since the 
Second World War, a number of attempts have been made to employ situ- 
ational tests. One of the best known of these was 
purpose was to develop instruments for 
clinical psychology (9). Potential tr 
hospitals came from 30 unive 
were housed at the University 
addition to tests of ability an 
were used, d 

The situational tests used in the 


a five-year project whose 
selecting student trainees in 
ainees in Veterans’ Administration 
ties scattered across the country. They 
of Michigan during the testing period. In 
and personality, a number of situational tests 


5 Michigan study were modeled directly 
after those employed in the OSS program. One consisted of a discussion 
situation in which the candidates Were told to act as a citizens’ committee 
and discuss the use of a large financial bequest. A second situational test 
involved improvising responses in a Social situation, A third consisted o 
a group problem-solving task involving the moving of a set of heavy 
blocks. The fourth test required the individual candidate to demonstrate 
the emotions expressed in poetry and in particular words, with gestures 
and facial expressions only. á 
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Whereas there was little opportunity to study the validity of situational 
tests used by the OSS, an extensive ‘study of validity was made in the 
Michigan project. Criterion ratings were obtained from university and 
clinical supervisors some three years after the original testing. Trainees 
were rated in terms of clinical competence and preference for hiring. The 
ce made earlier in regard to performance in situational tests corre- 
1 with clinical competence and 20 with preference for En 
: er, standard printed tests worked as well. Situational test results 
added nothing to the predictive efficiency which was obtained from 
objective test results and personal historv information. 

T There are numerous practical drawbacks to the use of situational tests. 

hey often require elaborate equipment, much work space, and many 
Persons to administer the tests. It is often necessary to use trained actors 
in the tests and it is difficult to standardize their performance. Much 
Pretesting must be done on the test design, and raters must practice their 
Jobs. Only one or only a few men can be tested at once, and it is time 
! ge numbers of men through the tests. It is difficult 
na ously tested often pass on information to peop iue a ; 
he tests. Tf the situations are not changed often, with the result of much 
additional work. or if elaborate precautions are not made to keep secret 
the nature of the situations to be used, new subjects will “rehearse” their 


ees prior to the tests. 
to MÀ ed situational tests 
Over. dardize. Only if thev produc: 
effort 8775 conventional measuremen 
Stand; e tests san e 
Am ard, The evidence so far indica at 1 
rati iger-term general acquaintance WI a E À 
tings made about behavior in contrived situations. md E á 
ehavioral Tests. Another method of behavioral rdi s T » b 
«t Some objective product of activity in lifelike sutan he = con- 
cerns what an individual actually does rather than ae p i 3 
ne of the earliest and still the best-known use oF be ruin ests was 
that of Hartshorne and: May (8) of the Character Education Inquiry. 
Ney Wang . traits in school children, such as hopan 1 
“iness, and self-control. Rather than use 1 
tests or ratings to measure these characteristics, they 1 1 
the actual bahásior of children in respect to the traits. iu 77 5 oe 
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was obtained by allowing students to grade their own papers and noting 
the number of alterations of answers. Tests like vocabulary and arithmetic 
reasoning were administered in the classroom. The tests were collected 
and a duplicate copy made of each. The original unscored papers were 
returned to students along with a list of the correct answers. Children 
scored their own papers and either gave themselves the correct grade or 
altered answers to improve their standings. Scores reported by students 
were then compared with the scores students actually attained, as meas- 
ured by the duplicate test copies. The amount of discrepancy between 
the two scores provided a measure of cheating. . 
Another of the Hartshorne and May tests concerned the trait of charity. 
Each child was given an attractive kit, including 10 articles such as pencil 
sharpener, eraser, and ruler. After the children had examined the mate- 
rials for some time, they were allowed to give aw. 


ay some or all of the 
items to “less fortunate children,’ 


' The children were not coerced to 
donate, and the donations were made anonymously. Each child was pro- 
vided with a large envelope in which to put his donation, and the dona- 
tions were dropped into a common box. Unknown to the children, 
envelopes had been marked in such a way as to identify the donations 
of each child. The number of articles donated served as a measure of 
charity. 


When behavioral tests of the kind originated by Hartshorne and May 


can be used, they have a number of attractive advantages. The use of 
actual behavioral products frees the measurement procedure from the 
subjectivity of rating scales. If observations can be made in natural situ- 
ations where the subject is unaware that he is being tested in any sense 
the results are probably more valid, One of the faults of situational tests 


is that they are more like games than real-life situations, and subjects 
often regard them as such, 


Other than the Hartshorne and M 


ay studies, there have been few 
attempts to develop systematic beh 


avioral tests, although much the same 
thing is done informally in many evaluational efforts. The major uses o 
behavioral tests have been with children. This is because children are 
easier to find in group situations or easier to place in group situato 
without having them Suspect an ulterior purpose. Also, the behaviora 
products of children are usually simpler and more easily measured than 
complex adult interactions, Behavioral tests are sometimes used in 
kindergarten and nursery school groups. Behavioral records can be keP 
either surreptitiously by an examiner, or the children can be watched 
through a one-way vision screen. Notes are made of the number of times 
a child offers a toy to others, the number of times he asks for adult help» 
and other relevant behavior. In order to get at more complex responses» 
1t is sometimes necessary to use both direct recording of actions and rat- 
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ings of behavior to measure such traits as “responsiveness” and “tendency 
to withdraw.” 

An adaptation of the behavioral type of test is presently being used to 
Measure the social responsiveness of mental patients. The patient is 
placed in a standard interview situation and records are made of actual 
behavior, One such routine is referred to as the Minimal Social Behavior 
Scale (3). A typical “item” consists of offering the patient a pencil and 
noting whether or not it is accepted. In another “item” the examiner 
Places a cigarette in his mouth and fumbles for a match. He stands up 
and looks through all of his pockets. A book of matches sits in plain view 
of the patient. The patient is scored “plus” if he mentions the matches on 
the desk or offers a match from his own pocket. A series of such behavioral 
Observations provides a picture of the patient's social responsiveness. 


MAXIMUM PERFORMANCE TESTS 
It has always been hoped that one day we should have tests, in the 
Stricter sense of the term, to measure personality characteristics instead 
having to use instruments like self-description inventories, which are 
Controllable by subjects. When an instrument concerns how much an 
2 dividual knows, how many problems he can solve, or how well he per- 
lms, we tend to think of it as a measure of some intellectual — 
Istic as distinct from personality characteristics. However, it is inp y 
possible that personality characteristics are oe A ud appear to 
ability,” “cognitive,” or “maximum performance” type tests. 
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vision.” Dark vision is, quite simply, the ability to see well at night, 15 
contrast to the usual criterion of good vision, the ability to see well i 
daytime or in lighted rooms. It has been known for some time that c cii 
vision is not necessarily related to day vision. Persons who can ge m 
well in the daytime often have great difficulty in seeing well at m 
and vice versa. Eysenck compared the dark-vision ability of li 78 
patients with 6,000 apparently non-neurotic members of the British s : 
On a scale ranging from 0 to 32, the neurotics had a mean score gr fs : 
and the normals had a mean score of 19.3. Additional investigations 
showed that the dark-vision test tends to discriminate among neurotics 
in terms of the severity of their disorders. 

Another perceptual measure which is currently receiving some atten- 
tion as a possible test of personality characteristics is the color-form domi- 
nance test. The test consists of presenting on film visual images of a 
composite nature so that the subject can report a perception of ra 
ment which is determined either by colors or by form, or shape, only. It is 
found that some people characteristically see the color-determined per- 
cept and others characteristically see the form-determined percept. The 
evidence so far is insufficient to say whether or not color-form dominance 
relates to particular personality characteristics, 

Persistence Tests, Tests of the 
used to measure 


maximum performance type have been 
viduals will work at difficult or even 
Supposedly measure a trait of persistence, 
anguage means the tendency to stick with problems 
when the going is tough. In a typical test, the subjects are presented with 
à very difficult mechanical puzzle. Some subjects will give up after a few 
failures; other subjects will work on in spite of numerous failures. Similar 
tests can be constructed for perceptual, verbal, motor, and numerous 
other ability functions. Studies to date (see 2, pp. 284-290) indicate that 
there is a general trait of persistence underlying many of the tests. The 
trait correlates to some extent with intelligence test scores but correlates 
more highly with school grades; this indicates that an important per- 
sonality attribute is at least being touched on in persistence tests. 
Response Sets. It is possible to differentiate people not only in terms 
of the test scores which they make but in terms of the way in which they 
take tests as well. Regardless of what the test is about, people bring with 
them certain test-taking habits, or “response sets,” which, in part, deter- 
mine their scores. One of the oldest Observ; 
personality, or as it was called in older d 
in psychophysical studies of judg 
judge whether stimulus A is Sreater or less than B or he 
with a third choice, an “indifferent” category, 
“neither less nor greater.” It was noted e 


how long indi 
impossible problems. These 
which in everyday |] 


ations along these lines is an 
ays, “temperament,” is involvec 
ment. The subject can be asked either to 
can be provided 
in which the judgment i 
arly that some subjects use the 
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“indifferent category” i 
judgments ipe 5 2 ether subjects do, regardless of the 
ingly involving S D nm ü ung the presence of a personality trait seem- 
One type E M willingness to take a stand. 
that of acc d responie set which has received some attention lately is 
ambiguous 1 i i Acquiescence is usually studied with either very 
s or very difficult statements like the following: 


The moons of Saturn were first dis- 
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| excess of porthymadrone results 
in dilation of the pupils. 
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neuroticism and other personality 
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man (6). The test was constructed for the purpose of issue din 
success of naval officer candidates in training school. A battery n aa 
ventional tests of the aptitude type was being used diui d sk as 
time, and it seemed that only by the addition of tests of the . 
type could substantial gains in validity be made. Although the i Mm 
Knowledge Test is ostensibly a measure of factual information prie 
naval customs, equipment, and history, it apparently 5 115 
components of personality and interests. The sort of person who w fae 
interest himself in facts about the Navy and naval history before = is 
the service is likely to be someone who will like naval life and 0 
at least moderately well on actual duty. The test added substantia ha 
the predictive efficiency which was obtained from the aptitude batter) 
alone. — 
Suggestibility Tests. It has long been thought that there is a connec 5 
between suggestibility and neurosis. One of the most widely used nan 
ures of suggestibility is the “body sway test.” In the test, the subjes 
stands blindfolded, and the experimenter says, “You are falling fores 
you are falling forward .. .” Under these conditions some people un 
sway very little or not at all and others will sway to the extent of actually 
falling. The amount of sway is measured by mechanical equipment. ee 
this type of testing procedure, Eysenck (2, pp. 296-299) found a defini Á 
relationship between suggestibility and neuroticism. Not only was there x 
significant mean difference in suggestibility between neurotics and nor 


FOR erms 
mals, but amount of suggestibility differentiated among neurotics in tern 
of the severity of their disorders, 


PHYSICAL AND PHYSIOLOGICAL MEASURES 


It is a truism to say that the wa 
à strong influence on his person 
personality is the result of long a 
also is apparently determined b 
chemical make-up of the indivi 
physiological traits c. 
as tests. 

There is a wealth of evidence to 


physical states and personality. When people are sick, they ag fh 
behave the same as when they were well, They are often more easily ue 
turbed, quick-tempered, and depressed, Ordinary medicines which i 
taken for “colds” and other disorders affect our moods and social peni 
Different drugs tend to have different effects. One drug tends to m 
people elated and another tends to depress, The impact of neural fur 


Y a person is physically constituted ie 
ality. Although much of what we M 
nd complex learning experiences, mu 
y the particular muscular, neural, an 
dual. In so far as relevant physical an 
an be measured and validated, they can be use 


; een 
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tions on personality is often witnessed in the altered behavior of people 
with brain damage. Even the relatively mild physiological impact of the 
changing weather seems to alter our moods and social behavior. 
Physique. It has long been thought that different types of body build 
accompany different types of personalities. The prevalent social stereo- 
type is that short, fat people are friendly and jolly, whereas tall, thin 
People are thought to be shy, and “serious.” There have been numerous 
vestigations during the last half-century (see 2, chap. 5) both to cata- 
logue body types and to relate these to personality characteristics. The 
y pology which is used most often is that of the endomorph, or short, fat 
build, the ectomorph, or tall, thin build, and the mesomorph, or athletic 
build, Very few persons fall neatly into one of the types, and conse- 
quently, each person must be considered some compound of the three. 
Whereas there are evidently some relationships between body type and 
Personality characteristics, the relationships are not as strong as was 
Originally supposed. One of the major findings is that, among the 
mentally ill, schizophrenics tend to be ectomorphs and manic depressives 
tend to be endomorphs (2, chap. 5). There is also some support for the 
okt: held contention that ectomorphs are more often introverted and endo- 
morphs more often extraverted (2, chap. 5). Perhaps the next decade A 
Paa will show more clear-cut relationships between physique an 
ersonality characteristics. 
; Pius opening cit autonomic nervous system regulates the d 
ung states of bodily alertness, reaction to fear, accommodation ps - cep, 
and many others. In spite of the many suggestions that people differ E 
autonomic charactesistics and that these in turn are a to m 
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The evidence is yet too meager to sa) with 
dicate what is being 


ty characteristi autonomic activity ( 
Y characteristics relate to au fered mainly to in 


ol : 
at owing suggestive results are oe 
tempted, 
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In terms of the first type of study mentioned above, there is some evi- 
dence to indicate that mobilized autonomic states (high blood pressure, 
salivation, muscular tonus, etc.) are related to neuroticism and person- 
ality inventory measures of shyness, introversion, and emotional ahead 
ity. In an interesting study of experimental changes in autonomic activ uy 
Freeman and Katzoff (5) measured the effect of sudden loud noises, 
electrical shock, and others on autonomic mobilization and recovery. 
They found rate of recovery to be significantly related to psychiatric 
ratings of emotional stability. R 1 

Blood Chemistry. The blood stream carries not only the nutrients je 
regulators of bodily activity but substances which influence our moods 
and social behavior. In so far as chemical differences among people can 
be shown to relate to personality differences, a "blood test" might 2 
to be a perfectly sensible approach to predicting certain kinds of socia 
behavior. Most of our knowledge about the relationship between social 
behavior and blood chemistry comes from introc 
blood and noting their effect. Experiment 
and. glandular injections on social beh 
ever, it is more interesting, from the 
ment, to search for existing 
relate these to social behavior. 

The use of chemical indice: 


lucing chemicals into the 
al studies of the effect of drugs 
avior exemplify this approach. How- 
standpoint of psychological measure- 
chemical differences among people and to 

5 to measure personality characteristics has 
received great encouragement from recent findings about schizophrenia. 
A number of independent investigators (see 11) have isolated a copperish 
substance which is present to a marked degree in the blood of most 
schizophrenic patients but present to only a slight extent in most normal 
individuals. Other studies show that when a substance from the blood of 
schizophrenic patients is injected into normal persons that a transitory 
state of schizophrenia develops which is indistinguishable from the 
behavior of real patients. These findings may lead not only to new 
methods of treatin but to a method for detecting schizo- 
phrenia as well. 

Summary of Physical and Physiological Measures. It is doubtful that 
the fine points of personality wil] be measured by strictly “physical” varia- 
bles alone. The evidence so far is that physique autonomic activity, anc 
blood chemistry may eventually define some broad personality tendencies 
in people. As the present findin 5s regarding schizophrenia indicate, phys" 
cal measures may prove useful in predicting whether or not people w. 

disorders, However, it is doubtful that physice 
measures alone will be the most efficient Way to make more complex pre 


1 
dictions, such as whether two people will make good marriage partne! 
ed asa physician, 


g schizophrenia 


develop severe mental 


or whether an individual will succe 
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BIOGRAPHICAL AND PERSONAL DATA INFORMATION 


15 1 to e Pct of 3 we bg return 
ibi cate it nse met p | that is his in everyday life. W ithout any 
vay of forecasting how an individual will behave in the 
ved 1 bet is that he will continue behaving in the same manner 
as in the past. This is the principle upon which so much of our 
knowledge of other people is based. By observing our family and friends 
in many situations, we gradually learn to predict how they will behave 
In new situations. The person who has behaved shyly in numerous social 
Situations in the past is not likely to switch to being the “life of the party.” 
The individual who has a long history of fights and arguments is likely to 
continue in that vein. The person who has shown jealous behavior in 
numerous previous situations is likely to act jealous in new situations. 
People do change, of course, but seldom so markedly or so rapidly as to 
render them grossly unpredictable to friends and family. 

Although we seldom keep systematic records of the validity with which 
we can predict the behavior of others, at least we feel that our forecasts 
are fairly accurate. There are, of course, many exceptions, which are evi- 
denced in the oft-heard phrase, “I didn't think he would do that.” It would 
de an odd world, and a frightening one too, if we were not able to predict 


the behavior of others with some accuracy. We can depend partly on cul- 
force most of us into relatively narrow 


if you say “Good morning” to 
fair accuracy that he will say 
turn. If he replied with “Fish 
e response, you would prob- 


tural mores and customs, which 
channels of social response. For example, 
an acquaintance, it can be predicted with 

God morning” or some variant of it in re 


115 sale” or some other highly unpredictabl 
ably attribute it to mental illness. Beyond the confines of culturally defined 


Actions, we also learn to predict with some accuracy the idiosyncrasies of 
Particular persons. We do this mainly through numerous observations of 
Nr oo ak 0 / 
Yavior in different situations. . basi 
Bather than accumulate information about people on an informal basis, 
Either the person can be 


a sys 3 p 
2 systematic accumulation of facts can be made. ; 
investigated,“ or the individual himself can be asked to supply pertinent 


information. This is essentially what is done in many job application 
Planks. The individual gives information about places of residence, health, 
Work history. military experience, hobbies, personal accomplishments, and 
others, Although the application blank is usually not employed as a test, 


it c; 
an be st; ^ ] sc ed 

standardized and scorec. REOR. 
i Although it might prove useful in the future to collect biographical 
nfor "nier i 
nformation 1 general personality traits, most of the uses so far 
Ave been in selecting personnel for particular jobs. The “test” is con- 
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structed much like any other predictor instrument. A large collection of 
“items” is made. Each item concerns a personal history event, action, or 
accomplishment. The items would, of course, likely be different for differ- 
ent personnel-selection situations. Some items which might be useful to 
predict the success of real estate salesmen are as follows: 


1. How many jobs have you had during the last five 
years? 

How long were you employed in your last job? — 
How many years of schooling have you completed? —— —— — 
Do you own your own home? 

How long have you lived in this city? 

Are you married? 

Do you belong to a fraternal organization or club? 
How much money do you have in the bank? 


Qo -19 SU 9r 


Here we are only guessing what might be important characteristics ofa 
successful real estate salesman. The real "proof of the pudding" is to learn 
which items differentiate successful salesmen from unsuccessful salesmen. 
À longer list of biographical items like those above could be administered 
to job applicants. One or several years later, measures of job success could 
be obtained, which might include ratings, amounts of property sold, and 
other pertinent indices of success. Then item-analysis procedures like 
those discussed in Chapter 8 could be performed on the biographical 
items to obtain an efficient predictor instrument. This might show that 
successful salesmen are usually married, own their own homes, and have 
been residents of the particular city for a long time. The most valid items 
can then be used to form a test, which can henceforth be used to select 
personnel. The finished product is called a biographical inventory, per- 
sonal data form, or personal history record. p á 

Biographical inventories have been employed successfully in a number 
of selection programs. One of the most striking examples is that of the 


visu ha (see i pp. 282-286) developed by the Life Insurance 
ency Management Association to select tive i i ales- 
men. Part I of the Index consists of 1 3 


i 0 biographical items concerning net 
5 p des applicant, age, social affiliations, and others Weighted Scores 
on the 10 items are averaged and transformed i : ; angin 
from A through E. Valid into letter grades ranging 


! ation studies (see 4, pp. 282-286) show that 
applicants who score A on the test sell between 300 and 400 m cent more 


insurance during the first working year than applicants who score E. 
Biographical inventories were among the most valid predictors of the 
successfulness of pilot trainees in the Army Air Force ieee, the Second 
World War (see 7, chap. 27). The inventories consned badkgyoun 
information on parents, social affiliations, employment, hobbies, athletic 
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accomplishments, and schooling. One form correlated .42 with the success 
of pilots in training. 

You might wonder why a self-report test like the biographical inven- 
tory works well as a selection instrument whereas self-report instruments 
ch as self-description inventories, generally do 
l inventories have a number of advantages over 
difficult to construct unambiguous items 
it is relatively easy to construct mean- 
ect is asked his age, how long he has 
ons. It is apparently easier 


of similar appearance, su 
not work well. Biographica 
self-description inventories. It is 
for self-description inventories, but 
ingful biographical items. The subj 
been married, and other understandable questi 
to hypothesize and compose valid biographical items than valid self- 
description items. The research effort needed in the former is therefore 
considerably reduced. Most important, whereas self-description inven- 
tories are quite often invalidated by the amount of faking in selection sit- 
parions, biographical inventories are affected little by faking. The reason 
or this is that most of the biographical items are factual matters of the 
us that can be checked on by others. An applicant 11 5 EN 
0 sd Willing to say that he loves his father more than his mo ; 1 
Appear in the best light would be quite reluctant to say tha 
Married when he is not or that he belongs to the Elks Club when he does 
not. There is some stretching of facts, such as the addition or subtraction 
: Years from one's age, on biographical inventories, but this is seldom 
0 gross as the nex of distortion encountered in self-description 


mwentories. T 
"ho Presently available for selection programs.“ ; eae ee eer 
à at kinds of personality attributes they measure, the i ed 

ad Substantially to the predictive efficiency which can be obtained trom 


ests : pem 
S of intellectual functions and special abilities. 


SUMMARY 

a hs easy now to see that the early workers in Po cheese 

a Nticip; : ickly and more easily than Us 
ated achieving their goals more quickly e^ ; 

Actually been the d m was probably due in large 5 

“omparative ease with which the early tests of human abi ity we I 

Structed and the many successful applications which have been en 


Sted Tr ‘this chapter aw that there are some hopeful approaches to 
5 are mainly complex procedures which 
4 


Perso, 8 

nalit ; these : ; 

require 4 Pn i bit f work both in their construction and 

their salty at ively large s adequate personality measures 
ation. The need, abaticus techniques would be worth 
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9 acut , 1 
e that even expensive 
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CHAPTER 17 


Selected Techniques for Psychological Experiments 


There are many psychological studies in which individual differences 
among people are of only secondary interest. In Chapter 1, a distinction 
Was made between studies of individual differences and studies of psycho- 
9Bical processes, a process being defined as a set of systematic changes 
within the individual. Studies of psychological processes are usually 
Teferred to as experiments, in distinction to the more general term research 
which is used to refer to systematic observations of all kinds, including 
8 Tas : 8 x (T. i À 
tudies of individual differences. 

The purpose of this chapter is to consider some of the measurement 
techniques that are currently receiving wide use in psychological experi- 
ments. In addition to their use in experiments, some of the techniques to 
be discussed are useful in “descriptive” studies, where the effort is to 
determine the mean responses of a group of individuals. The opinion poll 


Is an example of a descriptive study. 

Little attention will be given in this chapter to measurement methods 
or studies of physiological psychology, perception, and learning. These 
are treated adequately in available texts. The emphasis will be on meas- 
urement in social psychology and the study of personality. 

In its simplest form, an experiment involves two variables or measures. 
One. variable is manipulated by the experimenter, or the experimenter 
Waits for the variable to change by natural causes. The variable which 
changes in this way is called the independent variable. The experimenter 
then determines the influence of the independent variable on 
Measure, the dependent variable. 

The simplest form of experimentation can be illustrated with the 
known laws of gases. The problem is to determine the influence 
sure on the temperature of a gaseous body. A gas, sa 


a second 


well- 
of pres- 
ga y ordinary air, js 
5 i E k i li ! 
enclosed in a cylinder. A gauge on the cylinder tells the pressure of ü 
oter t i i ; i : 
gas. A thermometer notes the temperature inside the cylinder, and t : 
H > z C "ari: > x i ; i: " : 
l e dust oe Pressure is the independent vari ble 
er in the cylinder is slowly fore. isi pug 
A rn y forced down, raising the pressure of 
373 
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the gas inside. It is found in this way that temperature is a linear eet 
of pressure. An experiment in which change in the independent ae dd 
has no influence on the dependent variable can be found in Galileo g 
of falling bodies. If, as story has it, Galileo dropped a small and " 5 ie 
stone from the Leaning Tower of Pisa, he demonstrated that weight, 


$ ; alling bodies, 
as no influence on the velocity of falling 


n 
and measurement techniques which have a 
apters can and are used as variables in psy 
€ examples are as follows: F deter- 
1. Psychophysical judgment. The Purpose of the experiment is to 


; ; jects are given 
cohol on Sensory discrimination. Subjects : 
an ounce of alcohol each fifteen 


and 
minutes over a period of an sur 
a half. Five minutes after each “cocktail,” the threshold for faint 15 
signals is tested by the method of limits. Amount of alcohol con nt BY 
the independent variable and height of threshold is the depen 
variable. 

2. Achievement testing. 
the differentia] amount of 
tics. Six instructors each t 
given three lectures per wi 
section is given a | 


rmine 
The purpose of the experiment is to ES ds 
learning from two methods of teaching $ " is 
each two sections of statistics. One ee 
eek. In addition to the three lectures, the qe 
aboratory session in which statistical problems keit e 
cussed and worked through. In this experiment, the independent war ee 
is expressed as ; i Stinction between two methods of ins test 
tion. The depe onsists of a standard papa cad 
all students, An analysis is undertaken to d the 
Which method of instruction produces the higher scores and to udents 
kinds of items which are more often correctly answered by the stu 
in one or the other circumstance. rmine 
8. Intelligence testing. The Purpose of the experiment is to d col- 
the influence of distraction on intelligence test scores. A group of h. One 
lege students is randomly divided into three subgroups of 100 eac! sarate 
of the group intelligence tests is given to each subgroup In A, y em 
room. Group A is tested in a quiet room as free as possible rne music. 
tions. Group B must work while a phonograph record plays iv 2 d used 
Group C is bombarded with noise. The same d ipi from 
with Group B is played along with the noise of sawing 5 8 of a bass 
a buzz saw in the back of the room and the heavy 99 it depend- 
drum from the front of the room. Amount of 1 Wade d 
€nt variable and intelligence test score is the depent ee tani. is to deter- 
4. Attitude measurement. The purpose of the 8 des of Southerners 
mine the influence of living in a Northern city n severe attitude toward 
toward Negroes. The dependent variable, a mea 


which is given to 
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Negroes, is given to 1,000 migrant Southerners who bare, ee 1 
North from one month to thirty years. The independent Ve d y ihe 
number of months’ residence in the North. The independent variable is 
correlated with the dependent variable. : 8 
5. Projective testing. The purpose of the T osvchotherapv. 
personality changes that go on in two different dps p Pan 1 5 
n one type of therapy, the patients are seen in be n T ror em 
other, patients meet as a group and talk over 55 51 
Patients are given Rorschach tests before and after e n e dede: x 
examiners rate the Rorschach records for amount 9 „ ns ale made 
Problems, severity of illness, and other ee haar E vies 5 
to see which type of therapy produces the most ¢ 2 is the qualitative 
changes that go on in cach, The independent viria t Mes is the 
distinction between kinds of therapy. The ie ca aria 
amount of improvement shown by the 5 bie ae PPAR 
. The Validity of Experimental ber gongs Dead . 5 
Individual differences was discus cpi ad rationales were presented 
= Although the rules given there are 
ns, they are sufficient guides 


Made between predictors and a 
or validating each type of measure. : 
Not c ‘ly airtight for all testing problems, i 
f Completely airtight for : rical tests—in vocational guidance, school 
or the major uses of ps) RE Ero nmel selection. 
TY ssting, and pers i " P 
Management, classroom testing, : ; T variables be determined? In other 
Iow can the validity of Sd n ache or not an experimental tech- 
Words, hoy it be determinec H S p Pp" 
g v can it be 2881118 8 e rules which will 
lique measures what it is purported to measure? a r 8 . 
IC gi TE the validation of experimental ania D es [s S Jelinite 
ü Siven here for the h ter 4 for the validation of individual difference 
‘an those given in eres there is usually more difficulty in validating 
Measures. As a first principle, $95 : E 'ariables-am 
d ae zd independent variables. Independent variables are 
“pendent rather | e such as amounts of time, kinds of training, and 
Was dcc nis ris imstances, There is often considerable difficulty in 
deter variations of ve. endent variables are what they are purported to 
other depe ie n E 
herjer Jens 5 above, of psychological experiments were 
difficulty with which their dependent variables 


determining w 
de. The examples, 1 thro 
Ordered in terms of the 
"uid he qne, variable which can be considered an assessment in 

Any experi ae in Chapter 4 can be judged by the standards for 
terms of the ds measures are either reasonably self-evident measures of 
posed to measure, such as sensory threshold in the first exam- 
‘es founded on the principle of content s 


ampling, such as 
nt test in example 2. It is not with measures of the assess. 


t in the validation of experi- 
xperimental variables can be 


üSsessm 

What they p 

ple, or measu! 
a achieveme : : ] 

the ca that particular difficulty is me 
ant typi y 

ment p variables. Assessments used as e 

menta = 
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validated with equal ease, or difficulty, as those used in studic 
vidual differences, 

The difficulty in vi 
of instruments that 
above all use d 


rith the use 
alidating experimental variables. 5 and 5 
are tests of the predictor type. Exp a ali sify as 
ependent variables which the author * “he sredict mr 
predictors, The function of the dependent variable is 105 es change that 
future behavior Outside of the experiment but to measure i4 seal variables 
goes on within the experiment, Consequently, experimer 5 known 0T 
should all be assessments, Unfortunately, sures of the 
nothing is feasible 
predictor type. 
Whenever 


either nothing 


d OUT easuri 
to use in many experiments except mec 


the 
to 


x asures, 
experimenta] variables consist of predictor Pies less, ; 
experimenta] results are open to interpretation and, AS perimenta, 
uncertainty ag to what they mean, In example 4 above, the 55 after they 
results would show only that people express different ee a behavior 
have been in the North for a considerable period of time. V 0 2 measured. 
is important, and in some studies this is all that is meant to Wwe some moi? 
but in many experiments the Object is to show that attitude m 15 response 
basic sense has changed, It would be hoped that the 8 um response 
to the attitude-measuring device reflect rea] changes in the 2e 
to Negroes rathe à change in verbalized attitudes. 


7 
r than only we directly 
A 8 : A : ^ above are 
le Meaning of the results in the fifth example abo 

Rorschach test 


xaminers 27 
dependent on the and its use by 5 ae 
a measure of Personality attributes, Because of the peser da results 9 
questions about Validity, most psychologists would view t 
the experiment with Some skepticism, 
he Validity of predictor measures 
dependent on what the study 
experiments js to show th 
cumstances jn which the 
to show that distraction i 
ligence test 


-iments 18 
in psychological e hs core 
Purports to show. The beg the cir- 
at predictor measures are inflüenced „is meant 
Y are administered. Example 3, aie on intel- 
n the testing situation influences pisci as 
5. No one would interpret the results of = n testing 0 
showing that distraction influences intelligence but only 
intelligence. dl tea en thy r 

Some of the experimental variables which will ws > subject says that 
lowing Sections can be used either to record poni, to Jncdsurd 
he likes or that he thinks, or they can be used in go intention is only 
what he likes and thinks, In the former usage, b es himself and others: 
to determine what the subject is willing to c. the letter ence: when an 
the results can usually be taken at face 1 tendencies in the 
experimenta] variable purports to 1 A of the variables is open 
individual, not just verbal report, the validity 
to question, 
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When instruments of the predictor type are used as experimental varia- 

bles and the intention is to measure something more than verbal report. 
the principal validation procedure is that of construct validity as it was 
discussed in Chapter 4. That is, a measure of the predictor type gains 
Meaning of its own through continued use and research. That is why the 
intelligence test in example 3 above can be considered to have more 
Construct validity than the attitude-measuring instrument in example 4. 
And similarly, the attitude-measuring instrument in example 4 has more 
construct validity than the Rorschach test in example 5. The building-up 
of construct validity for an instrument is a slow process, and no instru- 
ment in psychology has been studied sufficiently to say unequivocally that 
it has a specified degree of construct validity. However, for many of the 
experimental variables that are in use, validity can be known only by 
the slow building-up of connections with other human attributes. 
, It would be undulv restrictive to expect that researchers who are work- 
mg on the “cutting edge" of psychological science should use only experi- 
Mental variables of known validity. In the beginning stages, a new meas- 
ure is more a hunch than a proved instrument, But it is also incumbent 
"pon psychological investigators to keep their measures open to continu- 
us research and not simply to retain measures because of faith, fiat, or 
fashion, É 


THE Q-SORT 


The Q-sort rating technique is most prominently identified with the 
Work of William Stephenson (see 23) and his colleagues. The particular 
rating scheme is only one aspect of a more general methodology for the 
Study of personality. Stephenson's methodology will not be discussed 
lere, only the handy Q-sort rating procedure, which lends itself to a 
Wide variety of studies. 

The Q-sort was developed as a means of recording and measuring 
multiple judgments, preferences, and impressions. In a typical study, sub- 
jects are asked to rate 100 photographs of art objects in terms of their 
Preferences, Instead. of rating each of the photographs separately, the 
Subject is asked to "sort" them in terms of a forced-normal distribution. A 
typical forced-normal distribution for 100 items is shown in Table 17-1. 


TABLE 17-1. Typical. FonCED-NORMAL Q-SORT RATING DISTRIBUTION 
ron 100 ITEMS 


Number of items 


€ 4 8 12 14 90 14 12 8 4 2 
Prefer — i^ — - E 
Prefer least | 0 1 2 3 4 5 6 7 8 9 10 Prefer most 


Pile number 
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In the forced-normal distribution shown in Table an T and place 
first required to pick the two photographs which he ord rs likes next 
them in pile 10. Then he chooses the 4 photographs w He pes 9, and so 
best of the remaining 98 photographs and places them "à [ ; pile 7. The 
on to the 8 photographs in pile 8, and the 12 photograp 5 et: out on à 
pile numbers are written separately on file cards and ; Photographe to 
large table or desk. The subject literally sorts out the Phe 
demonstrate his Preferences. dom both 

The most widely used approach is to have the subject we irst sorts the 
extremes of the distribution toward the middle. That is, he fi ‘working 
items into the several “most prefer” categories, say in this en working 
down as far as pile 7. Then he switches to the “least prefer ite the items 
down from pile 0 to about 3 in the present example. ds a approach 
are placed in the remaining middle piles. One reason why t ilie subjec 
is used is that extreme likes and dislikes are usually easier for iddle o 
to rate. A second reason for working from the extremes to the oo the 
the continuum is that the extremes are much more important statistica 
middle piles in determining the size of correlations and other e on the 
measures, Therefore, it is more important to have reliable rating: 
extremes than on the middle piles, 

A forced distribution can be 
item sample being used, T 
different sectiong of the 
procedure is to make 
without going to a more 
It matters very little wh 
approximates the norm. 

The number of cate 


any-sized 
Constructed quite simply te 
his can be done by referring to the a simple" 
normal distribution (see Appendix ae al curve 
up a distribution which looks like the aa form. 
rigorous effort to fit the normal distribui ] or only 
ether the distribution is precisely norma f 
al distribution, te in accord" 
Bories, or piles, varies in different studies —. piles 
ance with the size of the item sample. A rule of thumb is A por with 
with 60 to 80 items, 1] piles with 80 to 100 items, int 4 Pier than 
samples of more than 100 items, Most persons prefer an oe saddle pile 
an even number of piles. An odd number of piles provides a 


hat 
á at are related to W 
that can be used for “neutral” items or items that are unre 
is being rated, 


Item Universes and It 
Q-sort is its ability to h 
that ¢ 
items. 


" " of the 
em Samples. One of the appealing eigen things 
andle a wide variety of item materia be used as 
an be photographed, drawn, or written ie eei. that 
This provides à means of studying mars a 

i anageable. " 

would otherwise be experimentally unmanage rt is that it emphasizes the 

One of the Praiseworthy features of the Q-sor E &ctually used in the 

Sampling of item content. The — ni * e 8 

Q-sort, 100 photographs in the ys : . or a “stimulus sample.” 
sample, or it is variously called a “trait sample, 
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as representative as pos- 
When properly done, the item sample should be se . — discussed previ 
5 is — s 
sible of a specified "item universe." Content sampling w i nerally, A rating 
assessments genera X. 
ously in respect to achievement tests and — g bs Buon ng 
method like the Q-sort also has more 5 i ite 
: as sible items. 
broadly across a larger body, or universe, of possi presents the subject's 
Analysis of Q-sort Data. The completed Q-sor t repr E sle The 
preferences for or judgments about the things are fe ness hassle 
simple description of peoples’ wn e ond E : s compaie the responses 
to ace 10 i st cases the object is 8 iban! 
accomplish, but in most cas J are the responses of subjects 
of subjects with one another or to compa he correlation coefficient 
fore and after an experimental treatment. - ilysis of Q-sort data. 
: p i > analys ` 
55 the statistic which is used most often in thie $ made with the same 
ne sort can be correlated with any other that is i o persons rather 
* e = 1 2 „D Ji S a 
item sample. This provides a correlation iuis ud: ie ee 
E * - 2 Nee vo tests. hi 
than the more conventional correlation be Ped a shes ton eee 
tion between two sorts is an index of the similari y occasions. Because: af 
either by two persons or by the same person on we $ ERT =n te 
the forced distribution that is used for all a Ee applied The formula is 
T ation: rmula can be ap . 3 
Sample, a short-cut correlational form 5 »putation of correlation coeffi- 
Presented in Appendix 7. It allows the eod j^ required by conventional 
cients in only a fraction of the time that is req 
s a S 
€ s in Chapter 5. p 
formulas ached eating E s 5 carried on to a factor analysis. The 
Correlations : y sorts are often cc - 3 
ations among sorts » relations among the sorts of 
Correlation matrix contains all of the intercor 5 the correlation 
A sac o ` - 
* number of persons; or in à few studies (see 1 of sorts made bv 
Matrix is filled with the intercorrelations of a 5 —.— (see 10) can be 
only one person. Conventional factor- analytic tec | 1 
applied to the intercorrelations among eee n oat of ordering the 
ations among tests. Each factor represents a common wa) 
items by a numbe ersons. à 
15 ene P ntioned earlier with the 100 pictures of art 
i ratiy y me : i 

in the illustrative study subjects might be determined by realism, 
objects, th ferences of some subjects mig 77 
Dy the « i h the art objects accurately represent real-life forms. 

2 exte: ie he ar 2 3 
Y the extent to which the a ade in this way will tend to correlate 
© subjects whose choices are made in ) à s 
‘ J ; ay sort the 100 pictures in terms of 
With one another, Other subjects may ; : 

i . , lless of how well the items picture real- 
their impressionistic value, regardless o n 
life f, 1 . f the impressionists will tend to correlate more with 

* Fori S. y C . 

me Desang ith those of the realists. These are the 
One another than they correlate with 7 E 

au 3 arise in the correlation matrix, and the 

Conditions under which clusters arise ted ao tadon 
i : ara as fa 5 re 
impressionists and realists are likely to be separated z Mur s mus 
j s are i our ave 
formal factor-analytic treatment. Most persons are 115 The 5 a E NE 
à mixture of factors rather than to have one factor on y- ! nat is, a subjects 
Sort can be determined both by realism and impressionism. The loadings 
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which the subject has on the different factors show the extent to whi 
factors determine his preferences or judgments. been leveled 
Evaluation of the Q-sort. One point of criticism which has pom rating 
against the Q-sort concerns the forced-normal distribution. listribution 
technique assumes without any formal proof that a normal y, x uires à 
fits the ratings of different subjects. The forced distribution „ the 
subjects to have the same mean, the same standard eru 18815 are 
same distribution form. It is possible and quite likely that if gre one 
left to use whatever distribution they. like, they will differ fr 
another in the way in which they distribute the items. + SH 
It can be counterargued that the forced distribution is only s If, as is 
ience which does little injustice to the data one way or the other. analy- 
usually the case, correlations among persons and subsequent factor 2 


ne 
? vibe 3 istorts t 
sis are the major statistical treatments, the forced distribution js 

e D p? 3 > cO e 
data almost not at all. It was shown in Chapter 5 that the cc f two 
coefficient automaticall 


Y equates means and standard. deviations stt 
sets of numbers, The correlation coefficient is insensitive to ee un 
changes in the distribution form. If two normally distributed prs not 
correlate a particular amount, say .50, the correlation will usually a 
change by more than 
complete rank-order, 
them bimodal usually 
relations. The i 
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ch the 


en- 


„d to 
01 or so if both distributions are qam om 
Likewise, skewing the distributions or D lar cor- 
brings about only small changes in parany ut per- 
forced distribution then saves a great deal of time es anc 
forming Correlations (see Appendix 7) when only adding machines 
desk calculator equipment are available, subject t9 
The O-sort is à relative rating technique. It requires the si more 
decide which things he likes better or to judge which things P» * basis: 
of a stated attribute. Some persons have criticized the Q-sort on t of an 
claiming that absolute ratings should be used instead. An ee item is 
absolute rating is the conventional rating scale, in which each tinuum. 
rated Separately: either into one of two categories or along a i^ objects 
Absolute ratings could be used in the illustrative study vidas oH & 
mentioned earlier. Each photograph could be rated 5 
seven-point scale ranging from “like very much” to “dislike d seri, iil 
The people who criticize the relative ratings in the Q-sort eviden rd are 
to consider that almost all of the traditional psychophysical citer 1195 
relative ratings also. Looking back to the discussion of Beye oe aed 
methods in Chapter 2, it can be seen that the methods 1 ane 
rank-order, and pair comparisons are all relative 78 7 oe SA s with 
or pair comparisons could be used to study het than the 
respect to item samples, but both of mee dde FOR 
orced distributi i ittle gain in precis r " 
ie dE t 8 methods are better in general 
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than relative rating methods or vice versa, it is more fitting to consider 
them as alternative techniques for handling different kinds of problems. 
Relative ratings make more sense with certain kinds of studies and abso- 
lute methods are more sensible for others. Relative ratings, like the Q-sort, 
should be used only when the items are all from some common frame of 
reference, To take an extreme example, it would be incorrect to have the 
Subject sort 100 pictures, half of which are photographs of automobiles 
and half of which are photographs of houses. In making the Q-sort the 
Subject would have to decide not only which houses he likes better and 
Which automobiles he likes better, but whether he likes automobiles 
etter than houses. This would make for a very ambiguous rating task. 

Relative rating techniques, like the Q-sort, must be used when there is 
NO concrete basis on which absolute ratings can be made. For example, 
no one y er using absolute ratings in a study of lifted 

ES 


` vould seriously consid 1 : 
Weights, It would make little sense to have the weights judged on, Si 
Seven-point scale ranging from “very heavy” to “very light.” Similarly n 
many 9 mpressions, there is little concrete basis 
' For example, the individual can 
and lamb chops but have diffi- 


Studies of preferences andi 
en which absolute ratings can be made. 
Cel quite safe in his choice between steak 
culty in deciding how much he likes either in an absolute sense. Such 
Preferences are "inherently comparative. On the other hand, there are 
Many studies in which absolute ratings are essential. This is so in most 
Studies of attitudes. For example, it might be of some interest to learn the 
Individual's relative feelings about Negroes as opposed to Jews and Ori- 
entals; but the major purpose of such studies is to learn the degree of 
Avorableness of the individual's attitudes toward different groups. 

As a final point of criticism about the Q-sort, there has been some con- 
troversy about the objectivity of such ratings. People who have used the 
technique in clinical study, particularly projective testers, have spoken 
of the Q-sort as being objective. whereas critics of the method say that it 
'S subjective. A distinction needs to be made between. objectivity and 
Validity The Q-sort provides a means of objectifying impressions, but 
they are impressions nevertheless and not necessarily correct or valid until 
Proved so. The important point is that rating techniques like the Q-sort 
Provide the first step in studying the validity of impressions. 

Illustrative Studies. The Q-sort rating technique is not an adjunct to 
‘ny one field of psychology. It is a method for investigating a wide variety 
of problems. The following illustrative studies are presented to show the 
range of topics that have been studied. 

Personality. Nunnally (14) studied the changing self-conceptions of 
One subject over a period of two years’ time. The item sample was ob- 
tained from a three-months prestudy of the individual. Items were dis- 
tilled from protracted discussions with the subject and with family aud 
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Tests and M easurements 
friends, Additional items we 
results. The 60 items used 
range of the things that the 
appropriate conditions. Som 


: » test 
TC suggested by projective and Ling ae 
in the O-sort were meant to 1 under 
subject says about herself or might sa} 
€ of the statements are as follows: 
a. I sometimes think of myself 
b. I feel pleasantly exhilarate 
c. I sometimes feel that I cor 


or burst into tears. 
d. When at 


| 

; " loved. 

as being neglected and ui 

d when all eyes SIGNO IDE: scream, 
z ; aS 

ild run out of the room I am it 


a party I can ge 
The subject used the 
from “most characteristic of my 
15 separate Q-sorts to describe 
factor analysis showed three f 
behavior, Predictions were 
sort and how the three f 
main purpose of the e 
self-conception, 


ny: 
i " moton) 
t quite drunk to escape the mc 


tems 
, oa the jter 

item sample to describe herself, dier ie made 

self" t : 


her bel 


© "least characteristic. r 
tavior in different social S 0 
factors, representing three different i would 
then made about how the individua yy. The 
actors would change after psychotherapy: 


ing 
. „s of studyinè 
Xperiment was to demonstrate ways of 


gs. ^ 
e 


Communications, Sw. 
cerning the reade 
the effect of title 
dred title pages 


»s con- 


r R r sort studic 
anson ! carried out a series of Q-sort st ] in 


restet 
rship of Magazines. In one study he was sips pu 
Pages on the readership of different articles. c wr 
Were sampled from the issues of a particular d squares 
Over several years, These were placed individually on ipii pna Mon M 
and used as the items for Q-sort ratings of preference. The resp btainec 
20 subjects Were factor analyzed, The four factors which were kinds 0 
reflected interestingly on why different people read different > of the 
articles, Illustrating the type of results which he obtained, E ting to 
factors is a clear measure of sensationalism. The title pages reaa 


nen. 
à : 3 s won 
the factor contain pictures of corpses, daggers, guns, voluptuou 

and the like, The titles the 


8 od- 
mselves contained terms like B-girl, ge 
hound, murder, FBI, and sex 
Psychotherapy, The 
theories about the 
Fiedle 
about 


a illerent 
field of psychotherapy abounds precii 
therapeutic process and about. methods 85 cherapists 
r (6) performed a Q-sort study of the VIEW ROMI ^ statements 
the therapeutic Process. The item sample E interact 
concerning a wide Variety of ways in which the 5 psycho- 
with the client, The therapists included 20 5 sach school 
analytic, nondirective, and Adlerian schools. Ten eee: 5 non- 
were expert Psychotherapists and ten were 3 „ sample in 
expert therapists. Each therapist was asked to b See one situations: 
such a Way as to describe his behavior in „ schools BE thought 
Correlational analyses showed that experts in the three s 


É ^ E. Swanson. 
* Personal communication from C. E. Swanso 
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Save Q-sort descriptions which were more alike than were the descrip- 
tions of experts and nonexperts within any one school. The results suggest 
that therapists become more alike in their practice with continued experi- 
ence and that differences between schools of psychotherapy are not as 
marked as they would seem. 
Esthetics, Stephenson (23, pp. 128-144) used the Q-sort technique to 
study the esthetic values of art students. The item sample consisted of 
ngements of colored rectangles. The rec- 


120 cards nets 
cards containing various arrar 
to be either overlapping or non- 


sius were arranged in such a way as to à 
-cr'pping and to form either regular or irregular designs. The hypothe- 
SIS was that experienced artists would prefer irregular and overlapping 
designs and that artistically untrained persons would not prefer that 
Combination, Stephenson's interest in the problem was largely methodo- 
Ogical, to show how the Q-sort could be used in studies of esthetics. A 
small-scale investigation with 18 subjects tended to uphold the hypothesis. 


THE SEMANTIC DIFFERENTIAL 
; The Semantic Differential rating instrument is most prominently identi- 
fied with the work of C. E. Osgood (see 19) and his associates. It grew 
9ut of the need for a measure of “meaning,” for a way of indexing the 
Manifold implications of objects, persons, and ideas for the human 
Observer, Perhaps no other construct is as inextricably interwoven with 
Psychological and philosophical theory as is meaning. It is axiomatic that 
human behavior is determined by the meaning of events rather than by 
Intrinsic properties of events. The baby reacts approvingly to its mother’s 
Voice because that has acquired the meaning of nourishment, warmth, 
ànd protection. A soft buzzing sound will strike fear in the experienced 
outdoorsman for whom the meaning is rattlesnake and danger. On a more 
complex social level, different people harbor different meanings of con- 
cepts like Republican Party, Jew. Protestant Church, Cadillac, college, 


and even chop sucy. 

The fundamental hypothesi 
that certain important compone 
rating of objects or ideas in respect to bipolar adjectives. Bipolar terms 
like good versus bad, strong versus weak, and intelligent versus ignorant 
are continua along which everyday meanings are expressed. On a meta- 
phorical level, we may, for ex mple, speak of a person being cold rather 
than warm. The implication is that the cold individual is an unpleasant 
person rather than that his bodily temperature is below normal. The 
American culture as well as a number of other cultures studied to date 
(see 19, Dp. 170-176) have a common metaphorical usage of a wide 
variety of bipolar terms, It seems to be almost universal that ^good" things 


s underlying the Semantic Differential is 
nts of meaning can be measured by the 
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Tests and Measurements 
imply white instead of black, 
i ue semantic Differ- 
Each set of bipolar adjectives is called à scale on the Sem: If-dozen to 
ential. The number of scales in Most studies ranges from a ha ein i 
more than thirty, The customary approach is to use a 1 the 
tinuum for each set of bipolar adjectives, The thing to his virum An 
concept, is placed at the head of a page above the bipolar i vented in 
abbreviated form is shown in Figure 17-1, 4 list of scales is i een most 
Appendix 8 which contains the bipolar adjectives which have á realko 
widely use Same set of scales is used to ra 


'd in studies to date. The : liffer only 
the concepts in a particular study. The Pages of the rating form c 
concept at the top of each sheet. 


f 
instead 0 
high instead of low, and full ins 


on- 


in terms of the 


;apoleon. 
Ficure 17-1, 4 Semantic Differential form for rating the concept Napo 


Napoleon 
Good n L di: L Bad 
Weak — a fi NS L Strong 
Happy — 


Sad 


Ignorant — oe Intelligent 
Fast — o Slow 
Heavy 


Light 
Multidimensional Meaning. Basic to the 
tial is the hypothesis that meaning is mu 
number of facets of meaning rather than only one continuum Laer 
things are ordered, At the Tawest level, each scale can be take similar 
Separate facet or meaning, Two Concepts can be said w mind 0 
meanings if they are given nearly identical scores on a wide ya nd not 
scales, If two Concepts have similar scores on some of the Sesa 01 
on others, this provides a Way of charting the ways in which : sed à 
ings are Similar and different, But if each scale must be — ui the 
Separate facet or meaning, this provides little iid ep. 2 
nature of meaning and also makes the results of any pius. d s like 
on the particular scales Which are used. Fortunately, bipolax i all aufi 
those used in the Semantic Differential evolve into a relatively sma i (18) 
ber of common factors. Factor-analytic studies by E and ape eon? 
and by others of a wide variety of scales applied to a wide varic 

cepts have agreed on the following factors: 


, identified by scales like 
l. The Evaluative Factor, most prominently ee oe fair- 
§00d-bad, valuable-worthless, kind-cruel, bitter : F 
unfair, and pleasant-unpleasant. 


1% Diferen- 
use of the 1 ne a 
P PM at there are 
Itidimensional, th 1 which 
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2. The Potency Factor, most prominently identified by strong-weak, 
large-small, heayy-light, and thick-thin. 
3. The Activity Factor, most prominently identified by active-passive. 
fast-slow, hot-cold, and sharp-dull. 
The three factors above are ordered in terms of the variance that they 
account for in the scales which have been studied to date. The evaluative 
factor is by far the most prominent of the three. It appears to measure 
attitude” much like that which is tested by conventional attitude scales. 
Most scales have at least moderate loadings on this factor and many have 
loadings well above .80. Some of the scales are mixtures of the factors 
tor only. For example, hard-soft is a mix- 


Te ^ " 
ather than measures of one fac 
d-tense is a mixture of evalua- 


ture of evaluation and potency, and relaxe 
tion and activity. f 

The three factors described above are useful in classifying scales. Thus, 
rather than include a large number of scales which are known to be 
highly evaluative in a rating form, only a smaller number of the most 
highly loaded scales need be used. Similarly, a few scales can be included 
to represent potency and activity. This will provide ample room for 
including additional scales that have particular relevance to the concepts 


being studied. 

By averaging scores on the most highly loaded scales for each factor, 
an approximate factor score can be obtained for each individual. Exact 
factor scores require more complex statistical treatments (cf. 27. 
chap. 15). For example, if an individual rates the concept labor union on 
four of the purely evaluative scales at positions 4, 5. 6, and 5 respectively, 
the average of these, 5, can be taken as his over-all evaluation of labor 


Ficunz 17-2, The scores for one person on the evaluative and potency factors for the 


Concepts God, My Father, Satan, and Beggar. 


Evaluation 
(high) 
e God 
e My father 
Potency (low) ——3— pesi 4 Potency (high) 
e Beggar 
e Soton 
Evaluation 


(low) 
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union. In a similar way, factor scores on the potency and „ 
can be found for lahor union or any other concept in the . 1 j^ 
The resulting scores can be pictured geometrically in what is ag ere 
semantic space. Each factor is used as an axis, resulting in 5 d 
dimensional scattering of points. The position of each point in the ic 
and the proximity of points to one another can be used to 1 "s 
meaning of particular concepts either for one individual separate y D = 
the average responses of a group. A two-dimensional example is sho 

in Figure 17-2. e der: 

Illustrative Studies. A great deal of empirical work has been und : 
taken in recent years with the Semantic Differential (see 19), ranging 
from studies of dream symbolism to cross-cultural studies of pe xn 
The following studies were selected to illustrate the wide variety o 
applications for the measuring instrument. 

Psychotherapy. Because of the traditional interpretation. of psycho- 
therapy as a way of changing attitudes toward or meanings about Ones 
self and the surrounding human environment, the Semantic Differential 
has a welcome place in understanding the psychotherapeutic process. One 
of the most interesting studies of this kind concerned the now famous case 
of triple personality first reported by Thigpen and Cleckley (26). They 
made a protracted study of a psychotherapy patient who periodically 
displayed such marked changes in attitudes, gestures, voice quality, and 
general behavior as to suggest an almost complete movement from one 
personality to another. The subject's customary "personality" was given 
the pseudonym of Jane, and the less frequent alternative modes of 
behavior were referred to as Eve White and Eve Black respectively. 
Thigpen and Cleckley followed the case until psychotherapy “killed off 
the two Eves, leaving only the behavior typical of Jane. 

The Semantic Differential entered the study as an independent measure 
of the different “personalities” and as a means of charting the changes 
brought by psychotherapy. A Semantic Differential form concerning con- 
cepts in the patient's life was administered to each of the three “person- 
alities” as they occurred in different psychotherapy sessions. The same 
three measures were taken later in therapy, and follow-up measures were 
taken over the subsequent months. With little knowledge of the case 
other than that there were three “personalities,” Osgood and Luria (17) 
attempted a blind analysis from the Semantic Differential data. They were 
not only able to distinguish the three “Personalities” and to provide con- 
siderable insight into the behaviors involved in each, but they also made 


Some correct predictions about the final personality which would be evi- 
denced in the ratings. 


Developmental Ps 


ychology. Following the hypothesis that meanings are 
learned and not inb 


orn in the child, there should be an increasing tend- 
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ency for children to agree with one another about meanings as they grow 
Up: For example, four-year-old children are likely to hold different. mesi 
ings about the word lion or a picture of the animal. One child will think 
ne lon E a pretty pussycat and another child will consider the animal 
as an immediate di r, lurking j ^ cellar d i 
dren and ak ET deceat Gan iens 1 xs FF 
0 ) ag ne another that the lion 
and remote from the immediate environment. 
M pasos (see 19, p. 989) used a Semantic Differential form to study 
c ea oc toward agreement in meanings of common objects as a 
Sie 2e age. Using a standard set of concepts and scales, subjects 
me ar at four age levels ranging from the first grade to the college 
Vibe : ound, as predicted, an increasing tendency toward agreement 
g subjects going from younger to older ages. 
Learning Theory. A number of studies have been undertaken to relate 


responses on the Semantic Differential to typical measures of learning and 
and his colleagues (see 19, pp. 155-159) studied 
long the subject takes to respond, 
Differential scales. They found 
emes of a scale, on the 1 
quickly than if the 


is ar 
$ powerful, dangerous, 


E formation. Osgood 
za yee between latency, how | 
that Aca emeness of ratings on Semantic 
and 24 S a subject marks a concept on the extr 
mark j or 6 and 7 positions, he makes up his mind more 
s made toward the center of the s sale. 
rps and Statts (22) have recently used the Semantic Differential in 
Se a da of the classical conditioning type. They simultaneously 
iu names of countries along with either positively or negatively 
pleasa ive words. Subjects then rated each of the countries on the scale 
Au PEU. Those who had received positive evaluation terms in 
woo with the name of a country rated the country as more 
athe subjects who had received negative . terms in 
e with the country. This suggests, quite interestingly, that 
, "ings as measured by the Semantic Differential can be conditioned 


RN stimulus-response connections, i udi : 
Studie is Attitudes. The Semantic Differential lends itself readily to 
S of public reaction to social issues, minority groups, and commer- 

feel that more is learned from 


cial pr ; 
he Products, Adherents of the technique 
Sec many scales on the Semantic Differential than from the one over-all 
cor, r ; 
e obtained with most conventional 


Used 
disor E Semantic Differential to study 
i rious social and far 


attitude measures. Nunnally ? 
public attitudes toward mental 
nilv roles. The form was 


a group of individuals 
g 


Ts and toward va 
n panel, 4 


admin; 

ir : an 
chos istered to 200 members of an opinior 
inp ee resemble the whole United States population 


in te = ge 
erms of emographic characteristics. Scale profiles for 
own in Figure 17-3. It was found that indi- 


a way as to 

ree of a number of d 
J: 

the concepts are sh 


Unpublished research. 
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; ; Tati silo: bad aves and 

viduals with a mental disorder are viewed as relatively bad, oc : 

T s ena Rah Al S renti- 

inactive, in comparison to normal people. The scale which best di e K 
ates public attitudes toward mental disorders from attitudes toward n 


Ficure 17-3. Semantic Differential profiles for the concepts Me ( 


), Old Man 
(--.), and Neurotic Man (. 


i 2 
). Each point represents the mean rating of 200 


subjects. 
Old man 
Neurotic man " Me 
Bad 1 i iN Ny ) Good 
Worthless Valuable 
Dirty Clean 
Insincere Sincere 
Foolish Wise 
Ignorant Intelligent 
Dangerous Safe 
Sad Happy 
Poor Rich 
Unpredictable Predictable 
Tense Reloxed 
Sick Healthy 
Weak Strong 
Delicate Rugged 
Passive Active 
Slow Fast 
Cold Warm 


mals is tenseness. Evidently “nervousness,” or anxiety, is the cardinal sign 
of mental disorder in public thinking. Psychotic disorders are differen- 
tiated from neurotic disorders by being rated more strong, less pre 
dictable, and more dangerous. 


Evaluation of the Semantic Differential. One of the 


ask about the Semantic Differential is, * 
concepts father and mother receive simil 
some consistent diffe: 


major questions a 
Does it measure meaning?” nd 
ar ratings by most subjects, wit á 
rences. Does this imply that father and mother mea! 


4 
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the same thi uu . 
Pour os ME 
Strued. But in dongle sense hey = 1 xd p» CN cam be d 
Suidance, security, and à 150 lis T li o 1 80 ue bo providing love; 
n come 0 also discipline. It is best said that the Semantic 
than 1 ss ia r pin ee af ur | meaning rather 
rather than p» pate 1 " i iie: " measures pir ang s of objects 
explaining d à perceptua reaction to 0 jets. Another w: y of 
metaphor " s to state that the Semantic Differential is concerned with 
Sore cal meaning. 

Of the 1 wonder whether sensible 
Sense dice that are applied to particul 
the scale d RN for an individual to rate ! 
Way in vds d or on the scale friendly-unfri um 
Practical ex i subjects can be told the basis for making such ratings, but 

xperience has shown that thev can do it anyv A very 
cf. 19, chap. 3) shows high reli- 
ngly high reliabilities for indi- 


ratings can be made with some 
ar concepts. For example, what 
the Empire State Building on 
endly? There is no concrete 


Getaile E 

bind t of reports on reliability (cf 
Vidua] pestes actor scores and even surpristt 
mode. purest form is exemplific poet ot nating 
Notative al technique. There is little else with whic ! to e en con- 
1 dependable index with which 
ed. In cases of this kind the 
as discussed. previously. The 
instrument (see 19) 


d the difficulty of validating an 


Semantic Dilfer” and certainly no ae 
Only appeal SES ntial scores cm be Gone z 
Many due 5 to construct validity as it W com 
nd the E les that are being undertaken w ith tire a ake 
i .. many relationships which are being shown with other variables 
adually demonstrating what the instrument can and cannot meas- 
ine wesen the wide we of the instrument and Heski ot Blue 
me search results, the Semantic Differential gives real promise as a 


asure of 
of connotative meaning. 


SIMILARITY JUDGMENTS 

12, pp- 246-251; 13) are available 
: 1 differences among 

d procedure of this 

nvestigator studies 


Yat lone of rating techniques (see P nas mi 
ersons, obi entirely on judgments of sinin à — 
ind ig à n and concepts. The most W idely v 
ther le method of triads. In a typical problem the i r stud 

f The stimuli can be anything from 


elati 
tl 4 c lonshj T Et. li 
^ P S among a zen stimuli. . 
Names of ps among a dozen : > The stimuli are presented to 
subiet of persons to geometrical forms. Ihe ich two are more 
imi Ject three at a fine and he is asked to judge which two are more 
dillere, Next, the subject is asked which bw of the three stimuli are most 
J a ble triads of the 


) with all possi 
as many triads as are necessary 


with all others. 


> Or e. 8 P 
he is presented selectively W ith 
at each stimulus appears at least once 
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The analysis of similarity judgments depends on the Leid ue pli 
times that “objects are judged to be similar to or See ingen 
another. The percentages can be obtained either by 1 x the par 
set of triads many times with the same person, or more typical me 5 
centages are calculated over the response of a group of 1 1 25 
procedures are available (see 1, 13, 21, 28) for 1 P n 
of the experimental stimuli on the continua which underlie the ge 951) 
A number of other methods of similarity judgment (see 12, pp. 246-2 
provide much the same results as the method of triads. "ET 

Perhaps the best illustration of similarity judgments can be ES nde 
a type of study which, to the author's knowledge, has not yet bec 3 epit 
taken. The problem is to learn how a group of industrialists in a par 5 d 
City judge members of the United States Senate. Twenty Senats "be 
selected in such a way as to cut broadly across what are thought 55 hi 
dimensions of political and governmental philosophy. The name of So 
senator is typed separately on a file card. One hundred industrialists 1 5 
used as subjects. Each js required to make similarity judgments by hus 
method of triads, The responses of all subjects are placed together 1 
table, showing the per cent of times that senator A is judged to be mo 
similar to B than to C, and so on. 

The data can be used to deduce 
industrialists view the senators, P. 
senators are viewed 


à number of things about how Ws 
rimarily it can be determined cu 
as being more similar in their political behavit : 
Secondly, an analysis will indicate the number of underlying ens a 
along which the senators are judged. For example, it might be apparen 
that one underlying continuum concerns forei 


'oncerns 
gn policy, a second concc 
fiscal policy in the United States 


h É ‘fare legis- 

» and a third concerns social welfare | d 

lation. The analysis would provide a means of locating each of the 20 sc » » 

" iR ; ; E 1% à bette 
tors on the three dimensions. In general, the study should provide a bet 


understanding of how sen judged by the particular group © 
subjects, à f 


Evaluation of Similarit 
judgments is that they 
example above, there 
a fraction of these ju 
The subsequent t 
controversial, 

It may help to understand the 
paring it with the Q-sort and Ser 


ators are 


y Judgments. One obvious feature of similarity 
are very laborious, If there are 20 stimuli as in E 
are 1,140 possible triads. If the subject makes eve 
dgments, the gathering of data is time consuming 
ask of statistical analysis is difficult and still somewh: 


TRA N s om- 
method of similarity judgments by € 


! be 
t mantic Differential. The Q-sort ould list 
applied to the hypothetical problem above by having each industrial 


describe each senator with, say, a sample of 50 items. The items a 
be such as “favors more fin aid to England, supports high tari 


ancia] : 
* E re aders 1 (Tore: a 
and "wants more federal aid to education," The Semantic Different: 
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would use each of the senators as a concept with scales like strong-weak, 
liberal-conservative, wise-foolish, kind-cruel, and safe-dangerous. 

The judging of objects in terms of similarity is both an advantage and a 
disadvantage. Similarity is a primitive idea that is difficult to explain to 
subjects, and it is easy to imagine experiments in which judgments of 
similarity would be hard to make. The use of similarity judgments also 
Provides little information for interpreting the results. It can be deter- 
mined that certain objects or concepts are judged as similar but an inter- 
pretation must be made of the underlying characteristic which makes for 
Similarity, In both the Q-sort and the Semantic Differential the respective 
items and scales indicate more directly the way in which things are viewed 
as similar, On the other hand there are advantages to the similarity con- 
Cept in ratings, It is not dependent on any set of items or scales, and thus 
the results often have a broader generality than those found with the 
Q-sort or Semantic Differential. Similarity judgments have to this time 
and will likely continue to have the most value in dealing with stimuli for 
Which items or scales are very difficult to devise. The major studies to date 
have largely dealt with perceptual objects, geometrical forms, colors, and 
Stimuli of that kind. 


TECHNIQUES FOR STUDYING INTERPERSONAL RELATIONS 
Much of psychology and particularly that branch called social psy- 
chology is concerned with the relationships between people and the ways 
chat People behave in group situations. Because of the complexity of 
zuman interactions ach the difficulties of measuring interpersonal re- 
Sponses, studies in this area have lagged behind most other branches of 
Psychological investigation, This section will consider some of the more 
Promising inroads into the measurement of group and interpersonal 


behavior. 
and most widely used methods for 
anding relationships within a group is meee 1 dos 
members for participation with one em i E 5 4 ae 
ated with a hypothetical situation in which x hig ae e 1 
ste eing divided into subgroups for work on a class ps ih is picts 
ee Would be to ask each of the students to note, in pir 5 ides 
eral persons with whom he would rather work. The choices can ther 
© plotted as a diagram, or a sociogram as it is often called. This is done 
y drawing i HO person to person showing the choices. Figure 
"el as a typical situation in which each student is asked to give a 
and second choice for a work partner. 
m 


anden eee One of the earliest 
Broup 

ustr. 
are b 


ation over what is known from 


He cerns = š , 
SOciogram contains no new infor Arx hrade 8 
u a 


Ne , 
choic : j 
'oices of individuals for one another, 
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picturing and understanding the groupings of people. Looking at the 
sociogram in Figure 17-4, we find that persons 5, 8, 11, and 12 form a 
group, or clique. All of the first choices and all but one of the second 
choices stay within the group. Person 12 is especially popular in the 
group, receiving the first-choice nominations of the other three members. 
Persons 1, 6, 7, and 9 form a second group. Persons 2, 4, and 10 form a 
completely closed group, in which all choices remain among the three 


H H H 2 
arenae 17-4. Sociogram showing the choices for work partners in a group of 12 
students, 


~ First choice 
=—=-—> Second choice 


persons. Person 3 is an isolate, and for th 
in working with any of the individuals, Persons 6 and 11 serve as inter- 
mediaries between their respective groups. Their relationship might be 
used to get better communication between the bea "n əs or help in 
getting them together on mutua] projects, Relationships of the kinds 
shown in the sociogram would prove useful in dividing the class into 
subgroups, either placing individua], into the natural groups which they 


form or Improving relationships between certain individuals to allow new 
groupings of the students. 


The en technique can be applied to a wide variety of inter- 
actions including systems of leadership and communications. For exam- 


ple, in a study of military leadership, it would be expected that a hier- 
archy would be formed in which each individual receives his orders fro™ 


; „ 
at reason might have difficulty 
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so eone ¢ 1 d i i 

above him, leading to the offi is i 
: n cer who is in over-all E 
m T-a 
En all command of 

In additi T g 

addition to è Si ictographic ins i i 
cher bres to the simple pictog; aphic inspection of the sociogram, 
s á are a number of indices that can be applied to sociometric data 
EL — In the case where each individual makes only one choice, an 
estimate of an indivi E ity i i 

idividua arity g z ai 

foll wen I's popularity in the group can be obtained as 


9 no. of persons choosing P; 
Choice status of P; = 1 NI So 
Na 


wh J 
ere P, stands for person i 
N stands for the number of persons in the group 


Be 
caus x Se Ra , ; ; 
ause the individual is not allowed to choose himself, the denominator 


Of the franks 
"y fraction is N — rather than N. 
indivi : 
they V roman are allowed to choose as many persons in the group as 
or as e, an index of over-all liking of the group members for one another, 
tat has been called “positive expansiveness," is obtained as follows. 
no. choices P; makes 
N-1 
e 20) for describing the group as a whole. 
ion of the extent to which all of 
integrated, into the group. 


1 - 


Oup int " 2 — 
egration = ———z — —rE— =< == 
= n = no. of isolates (individuals receiving no choices ) 


a . " 
Positive expansiveness of P; = 
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he ^o» are also available (se 
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le g lowing formula gives an indicat 
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Sroup members are incorporated, or 


d sociometric analysis is in 
structures. For example, it 
of leadership will alter the 
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Deas Most important use of sociometry an 
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Might ne experimental changes in group 

i de hypothesized that a certain type 
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Errela g " ^ : 
fingi elationships of group members in a particular manner. An experi- 
3 is undertaken in which the special type of leadership is instituted. 


Ss 
* analysis is made before and after the experiment to measure 
nter B in group structure. 

b Master . nal Perceytion. E 
ell he is determined by how the indi 
C understands the feelings of others, 


he 

fe ap vidual thinks of himself and of other individuals. Empathy, or 

si 7 ity to understand others, is a variable that has been studied exten- 

at ^w this re the psychotherapy situation it would be expected 
aderstand well how the patient thinks 

adjustment in marriage would 


uals understand each other's 
e of empathy is obtained by 


It is a truism that much of human social 
vidual regards other people, how 


Wi 
and the contrast between what 


e effe egard. In 
cels * clive therapist could ur 
$. It might also be expected that 
ts " part on how well the individ 
and feelings. A suggested measur 
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having an individual judge how another person will describe himself. 
This can be done with interest tests, self-report inventories, the Q-sort, 
the Semantic Differential, or other methods of personality description. 
The judge fills out the form in the way that he thinks the person to be 
predicted will fill out the form. The judge's marks are then compared with 
the actual ratings of the other person. If the two sets of marks correspond 
closely, it is said that the judge has high empathy, at least for the one 
person being judged. A statistical comparison of the two sets of ratings 
can be made either with the correlation coefficient or, better, with the 
D statistic discussed in Chapter 7. 

In addition to measures of empathy, techniques have been proposed for 
measuring a number of other aspects of interpersonal perception. Fiedler 
(7, 8) has performed extensive investigations with a measure named 
"assumed similarity of opposites" ( ASO). The subject is asked to rate two 
persons with the Semantic Differential: one individual whom he likes or 
with whom he works well and another individual whom he dislikes or 
with whom he works poorly. Fiedler employs the D statistic between the 
two sets of ratings as a measure of ASO. The larger the D, the more dis- 
similar are judged the high-preference and low-preference persons. 

Fiedler (8) found a relationship between the ASO scores of pilots and 
the effectiveness of bomber crews and between the ASO scores of tank 
commanders and the effectiveness of tank crews, Measures of effectiveness 
consisted of expertness in bombing and tank gunnery respectively. Each 
leader was asked to predict the self-ratings of an individual with whom 
he worked poorly and the self-ratings of an individual with whom he 
worked well, Fiedler found that the leaders of effective crews tended to 
regard their most and least preferred coworkers as relatively dissimilar. 

Measures of interpersonal perception have not yet met with substantial 
success, Many different kinds of rating forms and questionnaires have 
been used in the Studies, and it would not be expected that all would 
produce the same results, It is unreasonable to think that an individual's 
ability to judge the ratings of another person on, say, an interest inven- 
tory will be the Same as the ability to judge another person's ratings on 
the Semantic Differential, Ag is true in all measures of similarity, it is 
meaningless to speak of similarity between persons or between ratings in 
general. There is a current trend toward the use of factored tests rather 
than the use of a miscellaneous collection of items. This will permit ? 
more concrete statement of the Ways in which sets of ratings are similar 


and will permit a more direct comparison of the results obtained by 
different investigators, 


Measures of interperson 
and conceptual difficulties 
regarding the proper st 


al per Ception meet with a number of — 
. There has been considerable dispute (see : 
atistical procedures for analyzing the researc 
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reg en D statistic is presently the most logical and the most 
Finn a ite ee nature of the statistic makes the results 
dteu lies m à : to interpret in many studies. One of the conceptual 
adie poe S some of the measures of interpersonal perception, such 
Wer en uer. are so complicated, that even if significant results are 
the ons ü B. — to understand what they moan. Another difficulty in 
Tithe = “het of interpersonal perception is the confusion of accuracy 
BA dee ss rstanding of others with the individual s knowledge of general 
tr 14 5 Eom. pe The acceptability influence” was discussed in Chap- 
alii is here it was said that there is a tendency for all persons to give 
ar and culturally favored ratings of themselves. Therefore, what may 
one be accuracy in the understanding of specific persons may be 
cram itwledgg of cultural stereotypes, a knowledge of how people in 
tons (s vate themselves. Current methodological eee 
Ambiguiti 3, 11) are endeavoring to clean up the penne and iini. Moe 
es in this area, and it is hoped that measures of interpersonal 


ere : P 2 
Perception will eventually be of substantial use. 


MEASUREMENT TECHNIQUES FOR COMMUNICATIONS STUDIES 
ioe of human communications i pial ee — ie 
cons; r diverse scientific disciplines anc — : 1 s ll 
" sidered, much of human behavior can be subsumed under the term 
Communications. The basic ingredients in a communications system are 
iat determines the message; a set of signals 
is encoded; a method of transmission; and 

], that takes in and decodes or gives 
ents takes place in ordinary 
a; the idea is formulated 
her person's ears; 


s rapidly 


as 
or signe a person or device tha 
a receiy — the message is er 
meanin er, either human or mechanica yn 
Semper to the message. The sj tem of e " 
in ter, sation when first an individual has an pa Pw 
and Ms of verbal language; the sounds carry to th 

3 the human brain acts as the decoding mechanism. xs 
n ach step in the communications process offers many possibilities for 
“Search, and many of the instruments discussed earlier in this chapter 
oa in previous chapters can be applied. A vocabulary — 1 5 
1 nt, the range of language signs in the individual repertoire. Ac hieve- 
ment tests can be used to measure the differential effectiveness of differ- 

t Ways of encoding and transmitting messages. For example, it would 


© useful to know whether college classes conducted via television pro- 
d kinds of learning which are obtained from 


] methods of measuring atti- 


ote 
Ordin the same amounts an eem 
ary classr icipati raditiona 

ASSTO! ‘ticipation. met 
pa ara differential effectiveness of different mes- 


es . 
x are useful in learning the renti 
S and different media of communication. - 
Ne of the primary stumbling blocks in communicatio 


Sag, 
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lack of effective methods for the description of written and a 
messages. For example, if it is found that a particular type of e 
message is effective in either conveying information or — Um T » 
attitudes, the finding is of no general use unless the me: ge itse je ha 
concretely described. An adequate description must consider some b som 
important message characteristics, and only a scanty knowledge is pre: 

ently available about the important ways in which messages vary. aa 
“Message anxiety” is a characteristic which is rec iving wide — 

in communications studies. A communication is said to have — 
anxiety if it tends to frighten the audience, uses inflammatory ree iE 
and goes into detail about the dire consequences of particular events. ea 
present, characteristics like message anxiety rest largely on intuitiv 

grounds. It is difficult to state explicitly when 
sage anxiety or any of the other characte 
tions studies. Even the few ch 
available must be docume: 


a communication has HOSS? 
ristics being used in iden d 
aracteristics of mes ages that are presently 
nted intuitively On a “more and less" basis 
rather than measured directly. It can be expected that further develop- 
ments in the field of communications wil] depend heavily on the invention 
of measures of message content. The following sections discuss some of 
the measurement methods currently under investigation, " 

Readability. One of the primary needs is a measure of the simplicity 
and ease with which messages can be understood. It is common knowl- 
edge that there are different ways of Writing the same set of facts, and that 
Some ways are more conducive to the reader's understanding than others. 
A number of formulas have been developed (see 5, 9) which purport to 
measure the readability of written messages, The formulas mainly depend 


on counts of the length of words, either in letters or syllables, the fre- 
quency with which words oce à 


ays of stating essentially 
readability and the seconc 
The primary consequence 
enigmatic problems, if not in 
quiescent diversions, is 
Thinking for a lon 
you feel tired. 
Numerous examples have been show. 
correlate well with intuitive estimates of readability (see Table 172). 
However, the concept of readability is difficult to define and somewhat 
misleading. It should not be assumed that readability as measured by 
current counting methods necessarily means good writing. Scientific 
articles are nec rily less readable in terms of readability formulas. In 
scientific writing, complex words must often be used to convey complex 


of protracted cerebr 
terspersed syste 
a reaction of tedium, 

§ period of time about 


and 
ation on profound aH 
matically with less involve 


difficult problems will make 


n (see 5, 9) in which the formulas 
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thoughts. * : 
ae 3 pu a Titii leni is more readable in terms of the read- 
P mae 55 ai E: » not likely a better poem than Gray's “Elegy.” 
simple ia : — um in terms of the readability formulas if it is 
Siegen Th : ui lerstood easily even by persons with relatively little 
is ceed “ai " — formulas are most important when the message 
D ae — eae err o a wide audience. They would 
Seneral public Aeli — — acm rid 
tions, and the nk : 8 are, Fa g methods, sa ety precau- 
s le like. 
Taste 17-2, READABILITY SCORES FOR VARIOUS Kixps Or WRITTEN MATERIAL 
(From Flesch, 9, pp- 6) 


Syllables per Av. sentence 


Reading Description " i 
= of style "Typical magazine 100 words length 
8 —— , 
"aped cus Comics 123 8 
70-80 Pulp fiction 131 11 
60-70 MAIV easy Slick fiction 139 14 
Standard Digests, Time, mass 
50-60 a : nonfiction . 147 17 
30-50 Fairly difficult Harper’s, Atlantic 155 2 
0-30 Difficult Academic, scholarly 167 25 
Very difficult Scientific, professional 192 29 


an old device for testing 
has developed a similar 
among subjects, but 


te sentence is 
(see 24, 25) 
] differences 


a oat Procedure. The incomple 
Procedy . of children. Taylor ( 
of differ, for the study, not of individua quami qr s à 
Certai ences among messages. His methoc De gins m h H : de pria e: à 
i Ber number of words from the written material. This 15 ¢ one eit her by 

ing words randomly or by removing every nth word systematically. 
e iiS of empirical studies ? suggests that a good approach is to delete 

0 en word in a message. ) The mutilated messages are then given 

he &roup of subjects. They are told to fill in the blanks as best they can. 
b Mie oe for each person is the number of correct words placed in the 
obt; . The score for the message is the mean number of correct words 

shed by the group of subjects. 
—.— "ie mutilated messages ma ues e ea 
en. e first is ^p «cor ar »sSi f > Sec - 
high tee is a low Cloze — (“hard”) 1 g 
ze score (“easy”) message: 
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four nuclei, 


i fourth 
- which three degenerate and 1 
ivi i V g igrator 1 lei. 
divides to produce stationary and migratory eee 
the exchange of nuclei, — ^ ——— exchanged nuclei fuse 
divide to forma and a large nucleus. 


and 


The boy picked apples a tree in tgñge He gave 
them to farmer’s wife. She baked — apple pie for 


the — — Then the boy helped 


— — He ate two big 
farmer milk the cow. 


Taylor considers the Cloze procedure to be a new approach to the 
measurement of readability, which he regards as having certain 1 
tages over older methods. Suci and Nunnally 5 found correlations of 42 
and .48 of Cloze scores with Flesch (9) readability scores and Dale-Chall 
(5) readability scores respectively over a diverse sample of 70 messages: 
This shows that Cloze is related to conventional readability formulas; but 
the correlations are far enough from perfect to raise the question as to 


what additional variable or variables are being measured by the Cloze 
procedure, 


Regardless of Whether or not Cloze 
approach to the measurement of read, 


in its own right. Cloze seems to measure the psychological redundancy. 
Or repetitiousness, in Writing messages, If everyone can correctly fill in a 
deleted word, the Word is completely redundant in the message. This cds 
way of saying that the particular word Supplies no new information over 
the rest of the message. Cloze procedure and the concept of redundancy 
can be applied to auditory and visual patterns as well as to written mate- 
rial. Notes or melodic phrases can be either systematically deleted or 
deleted on a random basis from musical selections Subjects can be asked 
to “fill in,” either hum or play, the missing parts, | 

Taylor is presently applying Cloze procedure to paintings. He sections 
the painting into a large number of square areas, A fraction of the areas 
are removed on a random basis, Subjects are asked to fill in the missing 


forms and colors, By this method he hopes to find systematic differences 
among different Styles of painting. 


Redundancy and its measurement b 


ally prove to be an interesting Psychological variable. In some forms of 
written material, redundancy is a desirable characteristic. This is so when 
the purpose of the message is to "hammer in" a few important facts OT 
instructions. The interestingness of Written messages paintings, and musi- 
cal compositions might be dependent 9n optimal levels of redundancy: 
That is, it may be that a certain amount of redundancy is necessary to 


] 
procedure proves to be a usefu 
ability, it has interesting propertie 


Y Cloze procedure might eventu- 


‘ Unpublished research, 
Personal communication from W. L. Taylor. 
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make a stimulus pattern pleasing. A musical composition in which each 
succeeding note and pattern of notes is completely unpredictable from 
Previous notes sounds chaotic. On the other hand, a composition so repe- 
titious that each succeeding note and phrase can be anticipated with 
very high accuracy proves to be dull and uninteresting to the sophisti- 
cated ear. ! 

Content Analysis. A simple 
ess is to say that it concerns how who says what to whom. Who 
e in the communication process. The read- 
ability formulas and Cloze procedure are concerned with how the mes- 
Sage (what) is phrased. Although the term “content analysis” is often 
used to speak in general about methods of analyzing the properties of 
Messages, it will be used more specifically here to refer to the information 
?r meanings which the message embodies. 

Content analvsis came into prominence during the first twenty years of 
this century as a means of studying the content of American newspapers. 
Journalists’ were interested in ‘the subject-matter cov srages which pro- 
Voked the most public interest. Sociologists were interested in the influ- 
ig of newspaper content on the changing political and legal forces on 
i. heren scene. Content analyses Muring m 3 [oa rin 
5 iundreds (sce 2) and varied from the de n à g : 
wkespearian plays to the analysis of propaganda during the Seconc 

Vorld War. ^ 


wav of describing the communication proc- 
and whom 


ar „ © : x 
àre the source and audienc 


TABLE 17-3. CoxTENT ANALYSIS or NEWSPAPER TREATMENT 
or A MURDER TRIAL 


(Adapted from Berelson, 2, p- 41) 
Morning Sun. News-Post, Evening Sun, Afro-American, 
Me NE Am 
per cent per cent per cent per cent 
Hekar oS — 2 
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Uctive to James's case 34 = 2 E 


m in Table 17-3. The 
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his trial. The content of the four 


1 results of a typical content analysis are show 
Ysis concerned the differential trea 


a N ied 
egro, accused of rder, before 
, accused of murder, 2 ; 
Papers Was searched for material relevant to the case. Judgments were 
5 or unfavorableness of each article. 


mag 
© about tl ive favorableness 

he relative favora 5 r 
ne number f 5 a 1” “destructive,” and “neutral” mater ials were then 
vallo abe: among other things, that 


rang ue s 
th formed to percentages. The results show, 


o re 3 ral” c t but nothing 
á i ed to neutral conten b 
ning Sun had mo: space devot 0 
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that could be construed as “helpful” to the accused. The Afro-American 
was the most one-sided of the four papers, the content being predomi- 
nantly “helpful” to the accused. 

Table 17-4 shows an analysis of the literature of the Boy Scouts of 
America and of Hitler Youth. The content was scored in terms of the ends 
that were recommended to the youths. The emphasis in Hitler Youth was 
on the development of the individual as a member of a national com- 
munity, whereas in the Boy Scouts the emphasis was on improving the 
individual as a person. 


TABLE 17-4. CONTENT ANALYSIS RESULTS OF THE GOALS RECOMMENDED IN THE 
LITERATURE OF HITLER YOUTH AND THE Boy Scouts or AMERICA 


(Adapted from Berelson, 2, p. 38) 


Hitler Youth, Boy Scouts, 


Justification loi p 
per cent per cent 
As a member of national community 66 25 
As a member of face-to-face group 19 28 
Essential for individual’s own 
satisfaction or perfection 15 4T 


A third illustrative content analysis is shown in Table 17-5. The prob- 
lem was to determine the amount of cooperation between the propaganda 
agencies of Italy and Germany. Content analyses were undertaken of the 
replies made by Radio Berlin and Radio Rome to a speech made by 
Roosevelt. Counts were made of the number of references to each coun- 
try, to the collaborating country, and to the joint effort of the two 
countries. The content of the two radio stations proved to be so different 
regarding self-references as to suggest strongly that little cooperation 
went on between the propaganda efforts of the two countries. 


ES BY RADIO BERLIN 
1941 


TABLE 17-5. CONTENT ANALYSIS OF PROPAGANDA REFEI 
AND Ralo Rome IN REPLY TO THE ROOSEVELT SPEECH OF SEPTEMBER T1. 


( From Berelson, 2, p. 73) 


Self-references in reply Radio Berlin. Radio Rome, 
to Roosevelt speech per cent per cent 
Germany 96 22 
Italy 0 1 
4 77 


Axis 
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Measurement methods for content analysis have been dominated by 
counting techniques. The experimenter seeks something to count that W. 
have a bearing on a more general issue. In each of the three examples 
above, something was counted as a way of getting at a broader problem. 
In some studies the simple counting of things supplies all the information 
that is needed, such as would be the case in learning the different amounts 
of newspaper space or radio time devoted to news, sports, drama, and 
advertisement. In other cases, the issue being studied is much more com- 
plex, and the units which are counted are only tangentially related to 
what will be inferred from the results. 

After a decision is made as to what is to be count > 
message content, a number of steps must be undertaken to effect a ase 
tematic content analysis. If the content analysis is to describe the treat- 
ment of a topic in a "broad range of messages rather than in e pat 
ticular messages, a sampling of messages and message sources must * 
undertaken. If, for example, the problem is to study the different m s 
of time given to different subject-matter categories in Ug Lo; 
television programs, a broad sample of programs from United am P si 
sion networks must be analyzed. In addition it would be necessary either 
to draw randomly or vary systematically the number of Las er s 
lyzed during different periods of the day. A thorough sampling o 
kind is a large-scale operation. 

After messages are chosen 


ed or observed in 


for a content analysis, judges, usually 
referred to as coders, must be selected and trained for the work. mer 
of coders chosen depends on the purpose of the study. por dum m pi e 
analysis concerns how housewives react to romance magazines in 
would necessarily be drawn from a representative cross z € s 1 50 
makers.” If the work of the coders is concerned with te : 
the message rather than reacting to the message, € ^ — Sa 
be sophisticated persons who have some training 1 


studies. i 
A third step in the organization o 


a content analysis is the training of 
2 mi : i " A st d le 
coders, This may be a relatively easy job if a simple count is being € 
d E. i 5 a eristics ^oders 5 ake complex 
of readily observed content characteristics. If coders = 1155 : D 
enl isti sontent, a consider: rai 
judgments about the characteristics of content, à * x» $a ing 
1165 Y: in a recent study in whic! 
ori requir For example, in a re ) 
period may be required. F a 1 eee, 
author b. pe (16), coders were required to judge 11 = : 
: 3 > i ent o as they are 
absence of 10 factors concerning beliefs about mental hea p as = 
: rs, i acazines. It was neces- 
portrayed in radio, television, newspapers and magazines ee a 
0 à l l ijors as coders, and considerable training was 
sary to use psychology mà as 0 i i 
: e f i S ; 3 actors. 
necessary to 255 the meanings and implications of the content fa 
"e ‘ I analvsis is undertaken is to test 
One essential step before the content ana’) 


ite si " ^ aring 2 
the reliability of coders. This can be done quite simply by comparing the 
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judgments or counts made by several coders on the same material. If the 
coders agree on, say, only 25 per cent of their judgments, either the study 
should be abandoned or additional training should be given the coders. 
Unless the judgments or counts are so simple as to be agreed upon almost 
perfectly by all coders, it is wise to use either two or several coders for 
each message that is analyzed. As is usually the case in the averaging of 
ratings or judgments, a much more reliable set of results will be obtained 
by the use of more than one coder. 

The major snag in the use of content analysis comes when the question 
being studied is complicated to the point where simple counts or judg- 
ments cannot be used. To illustrate this point, consider the following 
excerpt from the analyses of Wolfenstein and Leites (29) of the charac- 
ters and themes in American motion pictures: 


It is probably a characteristic American conviction that suffering is pointless 
and unnecessary. In keeping with this, love in American films rarely involves 
suffering. . .. Dangers in American films tend to take the form of external 
violence, not inner conflict. (29, pp. 94, 99) 


Insightful and interesting as are such analyses, it is difficult to see how 
they could be substantiated by measurements or counts of message 
characteristics, 

Content analysis is essentially what is intended in many of the instru- 
ments of clinical psychology. Projective tests, particularly the TAT, are 
concerned with message content and the inferences which it provides 
about the individual’s personality. Here, as in numerous other content 
analyses, the questions under consideration are complicated beyond the 
use of simple counts and judgments about content characteristics. New 
and more powerful methods of content analysis are badly needed in many 
areas of research; but for some time to come it can be expected that 
many issues about message content will be studied by an intuitive 
approach rather than with explicit, standardized techniques. 
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CHAPTER 18 


Establishment of Testing Programs 


So far in this book, psychological measurement has been discussed in 
the abstract, without considering the real-life world in which measure- 
ment methods must be applied. The best efforts of the psychological lab- 
oratory can be thwarted by the unknowing or unsympathetic test user, 
and the instruments which are founded on the best of theory may be 
impractical and too expensive for general use. The person who is well 
grounded in the principles of test construction may lack the personality 
and the know-how needed to educate the test consumer, establish con- 
genial relations with administrative officials, and organize a testing pro- 
gram. The purpose of this chapter is to discuss some guideposts for 
applving psychological measurement methods. 

The Measurement Specialist in an Applied Setting. Large-scale testing 
programs are employed most often in vocational guidance, psychological 
counseling, school programs, and personnel selection. Special problems 
encountered in each of these settings will be discussed in turn. Regard- 
less of where the measurement specialist applies his skills, there are some 
general principles that he should follow in order to establish friendly 
relations. 

It is not uncommon for the measurement specialist to meet with consid- 
erable suspicion in the applied setting. The plant manager, for example, 
might regard the introduction of a testing program as a challenge to his 
authority to select and supervise personnel. The workers may fear that 
tests will be used to deprive them of their jobs or to shift them to less 
desired tasks. School teachers often react negatively to disruptions of 
schedule by psychological testing. The measurement specialist applies 
esoteric procedures like multiple correlation, item analysis, and factor 
analysis. These procedures tend to create difficulties in peri ad. 
research findings, and administrators often feel insecure in the face o 
statistical mumbo jumbo. , 

The first requirement for getting along well in an applied setting is to 
communicate franklv about the methods and purposes of testing. It shou : 
be made clear that the measurement specialist intends to fit in with 
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cannot be constructed and put in use before more dependable assessments 
of job performance are obtained. 

A second point of confusion involved between predictor and assess- 
ment tests is the assumption that the measurement specialist can, com- 
pletely on his own, manufacture job or classroom performance assess- 
ments. Using the example above, it would not be unexpected to receive a 
request for tests to measure the efficiency of taxicab drivers who are 
presently on the job. It must be explained that assessments concern 
human values, and although the measurement specialist can help imple- 
ment the values, the values themselves must come from those who have 
the responsibility to make decisions. 

Above all, the measurement specialist must steer clear of factionalism 
and dispute within an organization. A testing program can be used puni- 
tively to try to show, for example, that one administrator is more efficient 
than another. The measurement specialist should retain his standing as a 
scientist and gather facts in an impartial manner; otherwise he will lower 
his value as a research worker and his results will be trusted by few. 

Vocational Guidance. Vocational guidance may be incorporated in a 
school program, a psychological clinic, or in a separate government or 
commercial organization. People usually seek vocational guidance on 
their own or at the suggestion of an employer or teacher. The younger 
people come because they are unfamiliar with job opportunities and 
because they are not sure of their own aptitudes. Older people come 
because they are dissatisfied with their present positions or because their 
physical disabilities necessitate a change of jobs. : 

A wide variety of tests is needed in a vocational guidance setting. It i$ 
particularly necessary to measure different aptitudes rather than to meas- 
ure general intelligence only. Two persons with the same over-all level of 
ability may have very different strong and weak points in particular apti- 
tude factors. Tests must also be available for the "special abilities," such 
as the mechanical aptitudes. Interest tests are also widely used in voca- 
tional guidance, and they apparently serve their purpose well. Because of 
the variety of tests which must be employed with most subjects, it requires 
as much as one or two whole days to complete the testing and subsequent 
discussion of results. A thorough vocational guidance program is rather 
expensive to maintain. 5 

In addition to a thorough knowledge of psychological tests, the voc 
tional counselor must have a wide acquaintance with job requirements 
and employment opportunities. It would, for example, be incorrect to sug- 
gest a particular vocation which would utilize an individual's interests an 
abilities if there is a severe shortage of jobs in that line. loners 

The counselor does not tell the individual what job or vocation ne 
should enter. Tests are not sufficiently exact or all-inclusive to be able to 
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make a decision for the person. Often extraneous factors outweigh the test 
results. A person may appear best suited for a career as a painter or actor; 
but, if he has several dependents, the immediate need for a steady income 
a necessitate the choice of another vocation. The needs of a person’s 
family, the values of his subgroup, and the desire to live in a certain local- 
ity will all affect decisions about jobs and professional training. The pur- 
Dose of vocational guidance is to furnish the individual with information 
about himself and about occupations which, when added to what he 


already thinks and feels, will help him arrive at wise decisions. 
will consider the use of tests to 


„Psychological Counseling. Here we 
aid in making decisions about maladjusted persons. The setting may be a 
Psychological or psychiatric clinic or a mental hospital. The purpose is to 
diagnose the source of maladjustment and to decide the kinds of treat- 
ments that will do the most good. Whereas most of the instruments used 
m vocational guidance are group tests, individual tests predominate in 
clinical settings. Individual tests are used because the nature of the per- 
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OW scores on a test like the Stanford-Binet for very different reasons. One 
child may try hard, in a slow, strained manner. Another child might give 
Wrong answers with a sly look in his eyes and intersperse his wrong 
answers with highly intelligent remarks about the testing situation. Being 
able to note differences of this kind is very helpful in detecting sources of 
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The range of tests employed in a clinical setting is usually not as broad 
as that required in vocational guidance. Reliance is placed ‘principally on 
measures of general ability, particularly on the Stanford-Binet and the 
Wechsler scales, and on the Rorschach and TAT as measures of person- 
ality. Although there is occasional use of personality inventories, such as 
the Bernreuter and the MMPI, these are less favored than the projective 
instruments. 

Psychological counselors differ in the extent to which they employ psy- 
chological tests. Some counselors believe that the use of tests starts treat- 
ment off on the wrong foot. They feel that tests encourage the individual 
to think that the counselor is going to give him the “answers” to all his 
problems. Counselors with this point of view prefer to foster the indi- 
vidual's independence of thought, work with him in solvi ing his own prob- 
lems, rather than advise, manage, and direct. However, most counselors 
feel that tests are very useful in conjunction with treatment. Some coun- 
selors would prefer not to give and interpret the tests themselves, because 
they think that it will foster an unhealthy dependence. They would prefer 
to have the tests administered by some other psychologist and, to the 
extent that it is advisable, have the examiner explain the results to the 
individual. 

School Programs. The emphasis in testing programs changes from ele- 
mentary to advanced education. In elementary and high school, the test- 
ing program is usually tied directly to classroom instruction and the 
management of students. The measurement specialist is usually referred to 
as a school psychologist and usually has extensive training in educational 
methods. He helps compose examinations for particular courses, selects 
achievement tests, and performs diagnostic testing for those children who 
are meeting with difficulties. It is becoming popular to give all children 
a standard battery of tests, usually consisting of a group intelligence 
measure, interest inventories for older uhildzen, and sometimes attitude 


and personality inventories. 

Contrary to the procedure used in some other countries, there are no 
formal selection procedures in the United States to decide who can and 
who cannot enter grammar school and high school. Anyone can go on to 
a higher grade if he passes the next lower one. Consequently, tests are 
very seldom used for selection in elementary and high school. Tests are 
becoming increasingly popular in educational research, to find better 
methods of instruction and better ways of organizing curricula. 

In the late years of high school and in college, one of the major func- 
tions of a testing program is to aid vocational guidance, which was dis- 
cussed in a previous section. Selection is one of the primary jobs in many 
college and university testing programs. Many private colleges and some 
state-supported institutions admit only those applicants who show prom- 
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Il scholastically. In conjunction with high school grades, 
psychological tests are the major selection instruments. Tests for select- 
ing college freshmen are used very widely and they serve the purpose 
well. The predictive efficiency of the tests can be measured straight- 
forwardly by correlating scores with grade point averages obtained later. 
Because the number of college applicants is steadily increasing beyond 
the facilities of educational institutions, it is expected that psychologi- 
cal tests will come even more prominently into play as selection instru- 


ments. 


ise of doing we 


Personnel Selection and Management. Tests are being used widely in 
military, civil service, and industrial work to decide which persons will 
be given particular jobs or positions and to organize working conditions 
for maximum efficiency, safety, and pleasantness. Specialists who con- 
cern themselves with these matters are usually referred to as personnel 
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TABLE 18-1. RELATIONSHIP BETWEEN INTELLIGENCE TEST SCORES OF CLERICAL 
Workers at Time Tugy Were HIRED AND Jon LEVEL THEY ACHIEVED LATER 


( Adapted from Pond and Bills, 9) 


Intelligence test Low job, Middle job, High job, 
score per cent per cent per cent 
180 and above 4 42 54 
160-179 9 71 20 
140-159 33 59 8 
139 and below 87 13 0 


test remained in a low-level job, only 4 per cent of the persons with 
highest test scores failed to move up to higher positions. 

One of the major variables to consider in selecting personnel for some 
jobs is labor turnover. That is, the job may be easy to learn and most per- 
sons may perform well enough, but many employees leave the job after a 
short time. Because it is expensive to “break in" new employees and 
because a high turnover disrupts operations, it is important to select 
persons who are likely to remain on the job for a relatively long time. 
Tests can be used to predict the likelihood that an individual will remain 
on a job. Table 18-2 shows the relationship between intelligence test 


TABLE 18-2. RELATION or INTELLIGENCE TEST SCORES TO LxwGrTI or SERVICE 
FOR A Group or CASHIERS AND INSPECTOR-WRAPPERS 
(Adapted from Viteles, 10) 


Average length of 


Test score service, days 
90 and above 35 
80-89 87 
70-79 96 
60-69 100 
50-59 107 
40-49 142 
30-39 91 
20-29 91 
10-19 3 


scores and length of service in one job setting. In that instance, people 1 
the middle intelligence range are more likely to remain on the job for 2 
long period. It is likely that interest, attitude, and personality tests wi 
also be useful in predicting job turnover. fel 
In selecting individuals for some jobs, the major concern is how ern T 
the individual will perform. This is particularly so when the individu 
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(Adapte 
Validity coefficient 


Type of test 


Dotting 35 
Tapping AT 
Judgment of distance 18 
Distance discrimination 20 
Mechanical principles 2 
Arithmetic —.09 
Speed of reaction —.04 

.28 


Interest 
validity of various tests as predictors of accidents by taxicab drivers. In 
that instance the two motor skills of dotting and tapping (both con- 
pated with the rapid and precise use of finger and arm muscles) are the 
best predictors. The most predictive tests tend to be different for different 
Jobs, and in only a few cases have tests been successful in predicting who 


wi 
ill have accidents. 
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evaluations are best obtained from some of the research reports which 
will be discussed in the following sections. 

In addition to the test catalogues, a number of test bibliographies are 
available. Hildreth (7, 8) published a list of 5,294 tests, covering the 
major instruments that had been developed up to 1945. The tests are 
listed by topic, in order that a person who is interested in, say, manual 
dexterity tests can most easily find the instruments that are available. 
No effort is made to present research findings in respect to the tests or to 
evaluate them in any way. References as to where the test is published 
or where it is described in some detail are given for each test. Other lists 
have been prepared for special testing areas. 

Administration and Use of Tests. After one or several tests are selected 
as likely candidates for use in a particular setting, it is necessary to learn 
how the test is administered and scored. This information is usually sup- 
plied by test manuals, which can be purchased along with or separately 
from particular tests. Manuals usually describe how the test was con- 
structed, provide some norms, and in many cases report research findings 
on reliability and validity. The amount of research evidence given in the 
manual and the nature of the findings will often provide helpful sugges- 
tions as to whether or not a test should be tried in a particular situation. 

A particularly useful publication for all those who deal with psycho- 
logical tests is “Technical Recommendations for Psychological Tests and 
Diagnostic Techniques” prepared by the Committee on Test Standards of 
the American Psychological Association.’ It provides numerous sugges- 
tions for the development, validation, commercial distribution, and use of 
psychological tests. 

Research Evidence. It is important before accepting a test to examine 
the research evidence which has accrued from applications to different 
measurement problems. Although a test may look quite promising when 
it is published, it takes several years or more to judge how well it works 
in practice. One source of research findings is the Journal of Applied Psy- 
chology. You would search there if you were interested, for example. in 
the ability of the Strong Interest Inventory to select salesmen. The Jour- 
nal of Consulting Psychology has reviews of new tests and research 
reports on older tests, primarily those instruments which are used most 
widely in clinical psychologv. Summaries of research in whole areas of 
testing, such as the application of personality inventories in military psy- 
chology or the use of tests to measure brain damage, appear occasiona ly 
in the Psychological Bulletin. 

The most detailed sources of research information about tests are the 

"The publication appeared as a supplement to the Psychological Bulletin, 1954, 51, 


Part 2. A copy can be obtained by mail for $1 from the American Psychological Asso- 
ciation, 1333 Sixteenth Street, N.W., Washington 6, D.C. 
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When tests have finally been selected for use in a measurement pro- 
Sram, there are numerous practical considerations which will govern the 
efficiency and economy of their use. Many of the practical problems are 
Specific to particular testing situations: however, there are some rules 
and procedures that will generally help in maintaining a smoothly operat- 
ung measurement program. 
yo d Setting. A prim i : 
and n It should be well lighte 
is „ with the necessary a is 
Bu st to have a particular room perm à 
ae be equipped for the testing prog! : Ar ie eire P 
iring for electrical devices, cabinets for storing tes Ina erial, movie 
Projectors, public address systems, and others, may be required. 


Administration and Scoring: In most cases the examiner explains to the 
r tests are being administered and how the forms 


instructions should be sufficiently detailed to 
that will come up. For example, if a particular 
the test, the subjects should be informed that 


PRACTICAL 


ary consideration is to obtain an adequate test- 
i d, free from noise and other distractions, 
mber of de sks or tables. If possible, it 
nently assigned for testing, so that 
am. Special equipment, such as 


xe " 
e why particula 
Cover be completed. The 
ind E ue of the problems 
pencil is used with 
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more pencils are available if a point breaks. If the examinees are allowed 
to ask questions during the test, they should be told the kinds of ques- 
tions that will be permitted and also how to summon the examiner. When 
a large number of persons are being tested at one time, it is usually better 
to inform them that it will not be possible to answer questions once the 
testing is under way. Subjects must be told what to do when they com- 
plete the test or tests: go back and work on the problems some more, leave 
the room, or remain seated without looking back at the materials. 

It is usually necessary to employ several assistants when testing a large 
group of persons. They can answer questions if questions are permitted. 
They can help direct people to their proper work spaces, give out mate- 
rials, and collect them when the testing is over, As ants Are also helpful 
in preventing cheating. They need have little knowledge about the nature 
and purpose of the tests being used. : 

It is very helpful to "role play" the administration procedure before test- 

ing groups of subjects. This consists of going through all of the adminis- 
tration routine with a few people acting as subjects. They are asked to 
note any point at which the testing procedure is unclear. One subject may 
notice that at the bottom of the first page there is no information to indi- 
sate whether to go on to the next page or stop there. Another subject may 
not know whether or not he is to guess when unsure of the answers. À 
trial run of this kind will offer many suggestions for clarifying the admin- 
istration routine. ý 

Analysis of Test Data. High-speed computational machines are rapidly 
replacing the relatively slow human hand and human mind. Special 
answer sheets are now widely used for testing large groups of subjects. 
The answer sheets can either be scored with a stencil or, much faster, by 
machine (International Business Machine Corporation is one of the best- 
known distributors of such equipment). Hundreds of answer sheets can 
be scored in an hour’s time. 

It facilitates the subsequent analysis of test results if scores are recorded 
on punch cards. One card can store a considerable amount of information 
about a person: scores on a number of tests; assessment scores, such as 
job ratings and production records; and personal data, such as years of 
experience and marital status. High-speed computational equipment can 
then be used to find particular kinds of persons, such as those who score 
above a certain level on a test, those who have three years of experience, 
and those who are unmarried. Also, it is even more important that 
machine computers can make relatively fast work of the complex statis- 
tics which are often required in the construction and use of tests. A few 
years ago it would have been prohibitively laborious to undertake a 
factor analysis of 100 test items. Now it requires only several hours of 
digital computer time to complete the job. 
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THE USE OF HUMAN SUBJECTS 
The life sciences (including the biological and social sciences) and 
physical science have many points in common. Both search for lawfulness 
in nature, value objectivity above opinion and desire, and employ many 
of the same conceptual and mathematical tools. They differ mainly in that 
the life sciences deal with people whereas physical science is mostly 
concerned with inanimate objects or subhuman animals. Thus, the work- 
Ing routine of the psychologist is vastly more difficult than that of the 


Physical scientist for two reasons. First, human behavior obeys no simple 
laws, and consequently, progress in psychology is often slow. Second, the 
Psychologist must consider the feelings, needs, and safety of his human 
subjects. The physicist who is interested in the structure of crystals can 
Pound them, burn them, freeze them, and work without considering their 
"feelings." The psychologist must be very careful not to harm or, in many 
cases, even discomfort his human subjects. This section will discuss some 
of the ground rules for dealing with human subjects in psychological 


experiments and testing programs: . 

The Use of Disturbing Materials. 1t is often necessary to frighten, 
embarrass, or disturb people in order to carry out an experiment or test 
à particular attribute. For example, if you want to learn the elfect of 
anxiety on memory, it is necessary to create human anxiety to perform 
the experiment. This might be done by an electric shock, a series of 
embarrassing questions, or à film of a surgical operation. In some studies 
it is necessary to hypnotize a subject, isolate an individual in * m 
for several hours, or have someone act out the part of a m ie 
first rule in dealing with material that is likely to create stang 8 1 dra 
is that all subjects should be informed of the nature of the materia s be one 
the experiment, They need not necessarily be told what te wc bye pu 
hopes to find " all of the conditions of the study, but they shon d " least 
be told about the source of disturbance which will = emp aye ; = 
ondly, the subjects should all volunteer without unc ue mec 
the examiner, and they should be allowed to discontinue the experimen 
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f the primary difficulties in performing experi- 
d D obtain enough subjects. Whereas the 
obtain large quantities of the material he 
aly a financial problem, it is often difficult to 
‘rsuade them to participate in studies. They 
like most people, they are busy 
Consequently, many 
because they are the 
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most available subjects. There is a growing awareness that many research 
findings are different when the subjects are recruited from the commu- 
nity at large rather than from students only. } ay 

It is generally bad practice to rely on volunteers for developing tests 
and for research studies. although, as was said in the previous section, 
volunteers should be used when disturbing materials are part of the study. 
The people who volunteer are usually more knowledgeable abet the 
research area, women tend to volunteer more often than men, and it is 
likely that the personalities of volunteers and nonvolunteers are . 
different. In some cases it is possible and not unethical to assign 1 1 
groups of individuals to an experiment. This can be done in some school, 
industrial, and military situations. When subjects are not obligated to 
participate in a study, as would be the case in studying the attitudes of s 
particular church congregation, the experimenter should try to enlist as 
many of the people as possible. There will alwavs be some people who 
either cannot or will not participate, but if as many as 80 or 90 per cent 
are obtained, the results will probably not be strongly influenced. by 
*volunteer bias." 

Giving Information. People are sufficiently curious to want to know 
what a research study is about; and when psychological tests are used, 
they will want to know how well they perform. It must be decided prior 
to a study just how much information can be given to subjects. The sub- 
jects should be informed before the experiment what they can be told 
later about the results. 


One place in which decisions must be reached about providing informa- 
tion to subjects is in the routine administration of a testing program. 
There are certain testing situations in which the subject is usually in- 
formed in some detail about the test results, as would be the case in voca- 
tional guidance. At the other extreme, the subject is usually not told the 
results of personality tests given in a clinical setting. The information 
might disturb him and raise more questions than are answered. In 
between these extremes there are situations in which it is difficult to 
decide how much information should be supplied to the subject and to 
other people. : 

Subjects generally should not be told the results of tryout pam 
whose worth has not vet been established. The experimenter may ae 
that he is employing a measure of neuroticism, but subsequent qup 
studies show that it does not measure what is intended. It would be a pu 
injustice to inform a subject that he has a high neurotic tendency on H 
basis of the test. In general, the less known about the validity of a 
instrument, the fewer are the grounds for informing subjects of tes 
results. 
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ing, nail biting, gritting teeth, and other forms of “nervousness.” The ten- 
sion is even higher when the graded examinations are returned. 

The person who is new in the field of human measurement may be sur- 
prised at the amount of hostility sometimes shown by subjects after a 
research study or after taking tests. For example, after administering a 
tryout personality test, a number of the subjects are likely to come up 
afterwards and say that the test is no good (they hope) or that the study 
will show nothing. This is often a way of blowing off steam because of the 
tension brought on by the test, and it is part of the experimenter's job to 
accept the hostility and even encourage its free expression. A special 
follow-up meeting is often used, called a “cathartic session,” to allow 
people to vent their feelings about a study. This allows a give-and-take 
discussion between experimenter and subjects, lets the subjects feel 
that they are more than "guinea pigs," and usually reduces tension 
considerably. 

The importance of human measurement far outweighs the tension and 
embarrassment that is sometimes generated. We may all be created equal 
in the social sense but we are not all equal in terms of abilities and per- 
sonality characteristics. It would be a great social waste not to help the 
talented achieve as much as they can, and a great social harm to encour- 
age unsuited individuals to labor in ventures that will bring failure and 
discouragement. Taking a frank look at our own abilities and personality 
characteristics is not always pleasant at the moment, but it is necessary 
for long-range achievement and happiness. 
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APPENDIX 2 


Scale Transformations 


A problem that is often encountered in using psychologic 1 tests is oe 
transforming an obtained set of raw scores to a set with a particular ise ps 
standard deviation. For example, it might be found that the mean vm 
obtained raw test scores is 40 and that the standard deviation is 5. In une 
compare scores on the test with scores on another test, or in order to 2 5 be 
scores in an easily interpretable form, it might be desired to transform the s 
scores in such a way that the new scores have a mean of 50 and a stand oh 
deviation of 10. Transformations of this kind can be performed with the fol- 
lowing formula: 


Co Op 


Ka aes E M . 


where X, = scores on the transformed scale 
= scores on the obtained scale: raw scores 
, M, = means of X, and X, respectively 


To 0, = Standard deviations of X, and X, respectively 


The formula can be applied to the problem illustrated above as follows: 


(D) [o] 


= 2X, 30 


By this transformation a raw score of 40 would be transformed to a score of 50. 
and a raw score of 25 would be transformed to a score of 20. Because the 
formula is a linear transformation, it does not change the shape of the score 
distribution. 
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of the Normal Distribution * 
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Derivation of Measurement Error Formulas 


Currently there are two mathematical models from which measurement error 
statistics can be derived. Although the same statistics are developed from both 
models, the models make different as imptions about the data. The first model 
to be examined considers any particular test as one from a universe of possible 
tests. The following assumptions are made in the model: f 

1. All of the sample tests in the universe of tests have the same variance. 

2. All of the sample tests correlate the same with one another, That is. the 
correlation of test 1 with test 2 is the same as the correlation of test 1 with test 3. 
and the same as the correlation of test 2 with test 3, and so on. y 

3. All of the tests correlate the same with some outside variable (a criterion 
to be predicted, a test from another universe of content, or any other warfen 

If the above assumptions hold, the correlation of the sum, or average, of all 
the sample tests in the universe (which is usually spoken of as the set of "true 
scores) with the criterion variable is obtained as follows: 


Ez, (z1 zs a aa e 
vVar N/SU ba Fag Eo? 


v 
„(NL 


where Tian = the correlation of the criterion variable, y, with the sum of the 
standard scores on all sample tests in the universe 
indard scores on sample test 1 
tandard scores on test n 


non 


The equation above can be expanded as follows: 


T v T 
ulr) n 


where the denominator is the square root of the sum of all the coefficients in 
the square matrix of intercorrelations of sample tests. 


Because of assumptions 2 and 3 in the model: 
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What the formula savs is that, if the assumptions of the model hold, the correla- 
tion of a set of test scores with the sum of all of the scores from the tests in the 
universe equals the square root of the correlation between any two ampe tests. 
Sually the two tests which are correlated are referred 4o ien ent. Or 
Comparable forms. The correlation of any two sample tests, Tij »i usua y sy a 
olized as r,, and is called the reliability coefficient. Because the square ad 
the reliability p EUR equals the correlation of either of the sample tests with 
true” scores, r,, is said to equal the per cent of non-error yat inre j 
The model 4180 allows the derivation of the standard error s see 
and the other statistics which are needed in the study of measur ement sum ol 
Course the conditions of the model will not hold exactly in any 1 eam 
ample tests (equivalent forms) will have different e they | 7 
ate differently with any outside variable, and their intercorre ao $ > foka 
2e equal, Consequent the reliability coefficient is only ord by the sium 
ation of two particular sample tests, or equivalent forms. The advantage o the 
Model is that it shows what assumptions must be met in determining measure- 
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ment error statistics, and it provides an indication of the way in which depar- 
tures from the model can be studied. 

There are several disadvantages of the above model for determining measure- 
ment error statistics. One difficulty is that the model is based on a sampling 
theory, which considers any particular test as one from a universe of possible 
tests. It is difficult to define universes of test content and difficult to ensure that 
all tests are from the same universe of content. Another disadvantage of the 
model is that it requires an infinite number of sample tests, and in consequence, 
the infinity sign appears in the derivation of measurement error statistics. Infinity 
is. of course. only a handy fiction when di icussing empirical data. However, the 
model should hold with ‘good accuracy if a large number of sample tests, say 
100, is studied. Although studies of this kind have not yet been undertaken, it 
would be possible to develop 100 similar tests, such as 100 short spelling. 
vocabulary, or arithmetic tests and study the extent to which the model fits 
empirical data. "E 

The second model which is used for the derivation of measurement error 
statistics begins by defining an obtained score as the sum of a “true,” or uni- 
verse score, and an error score: 


* K +e 
where x is the score obtained by an individual on a particular test 
x, is the individual's true score 
€ is the error or chance portion of the obtained score 
The first as 
true scores. 


E , od with 
umption in the model is that the error scores are uncorrelated wit 
Therefore 


If there are two measures of the same trait, the true score is expected to be th 
same in both measurements, but the error scores are not necessarily the same. 
The two obtained scores can then be broken down as follows: 
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able forms. a 

The reliability coefficient is defined as the correlation between two equiva- 
lent forms: 
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Several assumptions are then made: : 
l. As in the first model, it is assumed that the two equivalent forms hav! 
equal standard deviations, and, consequently, equal variances. 
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2. It is assumed that error scores on each test are uncorrelated with true 
scores on both tests. 

3. It is assumed that error scores on one test are uncorrelated with error scores 
on the other test. 

If assumptions 2 and 3 hold, three of the terms in the numerator of the above 
expression vanish, and the correlation between equivalent forms reduces to the 
following: 


Because of the : ssumption that the standard deviations are equal on the two 


equivalent forms, the reliability coefficient may be expressed as 


ou 
Pj ae 
mu" 


What the formula savs is that, if the assumptions in the model hold, the correla- 
tion of two equivalent forms equals the ratio of the "true" variance to the vari- 
ance of either measurement or, regarding the ratio as a percentage, equals the 
Der cent of non-error variance. This is the same conclusion that we obtained 
from the first model. 

The standard error of measurement can be obtained through the following 
developments: 


op -—ou pec 


Both sides of the equation can then be divided by the standard deviation of x: 


G = o,\/1 s 


was used rather than o,. In a simi- 


In the text the more explicit symbol, yeas: : ^ 
] correction for attenuation and the 


ar way the model can be used to derive the 
other measurement error statistics. ; ds a 
he advantage of the second model presented here is that it is not explicitly 
dependent on a sampling theory as is the case with the first model presented. 
Owever, this apparent advantage is not all to the good. The statistical results— 
ne reliability coefficient. the standard error of measurement, eto. gare likely to 
„e somewhat different if a test is compared with a number of different equiva- 
ent forms instead of onlv one. There is then some confusion as to the correct 
Statistical results. and the model itself unable to deal with the inconsistencies, 
Ne can always sav that the reliability coefficient obtained from correlating any 
Wo equivalent forms is only an estimate, but the second model gives no hint as 
9 what is being estimated nor any means for studying the accuracy with which 
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estimates are made in general. Consequently, advocates of the second model 
have begun to think in terms of the relationships among collections of equivalent 
forms, all of which have the same variance and all of whose intercorrelations 
are equal. Therefore, a sampling theory is creeping into the second model; and, 
as the necessity for considering any test as one from a possible universe is 
brought into the model, the second model presented begins to collapse into the 
first model presented. 

There are other reasons why the first model presented is probably better than 
the second. In particular, some of the constructs in the second model cannot be 
measured directly. Neither "true scores" nor “error scores" can be measured 
directly and can only be inferred from correlations among equivalent forms. 
The first model presented does not require these nonoperational constructs. 
Another mark against the second model is that some of the assumptions are 
probably not true in particular testing situations. On some occasions it is likely 
that errors will be correlated with "true" scores and errors on one test will be 
correlated with errors on another test. The first model does not require either of 
these assumptions. 7 

It is comforting to know that two different, or apparently different, models 
arrive at the same formulas. This gives us some confidence in applying the 
standard measurement error statistics, such as the correction for attenuation and 
others. Future research should indicate which of the two models is more gen- 
erally useful in studying the effect of measurement error in psychological tests. 


APPENDIX 5 


The Influence of Test Length on Reliability 


If reliability is determined by the split-half method, the correlation between 
the split-halves is the reliability of half the test rather than the reliability of the 
we test. Of course the whole test is employed in applied work, and the reli- 
"ri id of the whole test is larger than that of the half-tests. The reliability of the 
hole test can be estimated as follows: 


9 ry, , 
F lf 
Where inis the reliability of the whole test 
Tan is the correlation between the split-halves 
2 formula is referred to as the "Spearman-Brown prophecy formula." The use 
the formula can be illustrated with a correlation between split-halves of .80: 
2(.80) 
711 an 
1 1+ 80 
1.60 
1.80 
= .89 
Thus s 
hus the best estimate of the reliability of the whole test is .89. 
7 What the Spearman-Brown formula does is to estimate the reliability of a test 
99755 as long (twice as many of the same kinds of items) as the test from which 
E: reliability is actuallv computed. The formula can be extended to estimate 
e reliability of a test lengthened by any particular amount: 
nri 
an 1+ (n I) rn 
Tan = the reliability of a test n times as long as the original test 
rin = the reliability of the original test 
If the reli 


T, 


wh ere 


70, and if 40 items very similar to the first 


ability of a 20-item test is € r } 
0 reliability of the 60-citem test is estimated 


items are added to the test, the 


38 follows: 


3(.70) — 
Fa IF (3— 1).70 
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= 88 


Some cautions should be heeded in applying the formulas for estimating the 
reliability of a lengthened test. The formulas assume that when the test is 
lengthened the new items will be of the same kind as the original test items. 
If the new items relate to factors other than those in the original items or if the 
new items are generally easier or more difficult than the original items, the 
formulas will give mi eading estimates. However, in spite of the assumptions 
which the formulas make, they provide a very useful, and usually a reasonably 
accurate, estimate of the gain in reliability that will come from lengthening a 
test. The formulas are also useful in estimating the amount of reduction in reli- 
ability that will follow from shortening a test. It may be desirable to use a 
shorter version of a particular test in order to reduce the time for administering 
test materials. For example, if the reliability of a 100-item test is known, and it 
is desired to use only 50 of the items, say, choosing only the odd-numbered 
items, the reliability of the 50-item test can be estimated by substituting 14 for n 
in the equation above and substituting the reliability of the 100-item test for rii 
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Sampling Error of Percentages 


In opinion polling or in any other use of samples of persons, the results 
obtained are only estimates of the true population values. Sampling error was 
illustrated in Chapter 13 by considering the results of polling 1.000 persons, in 
Which it is found that 55 per cent of the people say that they will vote for 
Candidate A and the remaining 45 per cent say they will vote for candidate B. 
— it can be safely said that candidate A has an edge over candidate B, it 
15 St be ensured that the apparent difference in voting sentiment is not likely to 
ae due to sampling error only. If more than 50 per cent of the population actu- 

| = ly vote for A, he will be elected. Consequently, the question is whether the 

| i gie value of 55 per cent who say they will vote for A is significantly greater 
^ an 50 per cent. This can be determined by the standard error of the per- 
entage, which is obtained as follows: 


70 
NNI: 
p * N 
where ; 
chere p the hypothetical population percentage 


Q=100—P 
N = the number of persons in the sample 


T a 
he formula is applied to the illustrative problem as follows: 


50 x 50 
gy = -N 
paw 
= N 1.000 
= 2.50 
= 1.58 


lividi 
viding the result bv the standard error of the percentage, as follows: 
.,98—930.. 9 . 3.16 
là 1.58 1.58 
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The resulting value is a standard score, and its significance can be found in 
Appendix 3. There it is seen that a standard score of 3.20 is nearest to the 
obtained value of 3.16. Only .00138 of the cases lie above and below 3.20 stand- 
ard scores. Here we are interested only in the proportion of people in one tail 
of the distribution. Our only concern is whether the population per cent is 
greater than 50 per cent. If it is, it makes no difference whether the population 
per cent is actually larger than the sample value of 55 per cent. Therefore, the 
value found in Appendix 3 should be divided by 2. which gives .00069. In 
other words, with a population per cent of 50 and with samples of 1.000 each, 
less than one time in a thousand would we expect to find sample values of 
55 per cent or greater. 

Another problem in which sampling error formulas are needed is in deter- 
mining the significance of the difference between the results obtained from two 
independent samples. For example. the results of one poll might show that 
55 per cent of the people intend to vote for candidate A, and a later poll, using 
a different sample of persons, shows that 60 per cent of the people intend to 
vote for A. It is tempting to interpret the results as showing an increased popu- 
larity of candidate A. However, before this can be done, it is necessary to test 
the significance of the difference between the two sample results. Formulas for 
the significance of the difference between two percentages and for more com- 
plex sampling error problems can be found in standard statistical texts.! 


"See A. L. Edwards, Experimental Design in Psychological Research, New York: 
Rinchart, 1950; Q. MeNemar, Psychological. Statistics, 2d ed., New York: Wiley, 
1955; Helen M. Walker and J. Lev, Statistical Inference, New York: Holt, 1953. 
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Computation of Correlations among Q-sorts 


One of the handy features of the Q-sort rating scheme is that it permits the 
computation of correlation coefficients with a minimum of labor. Because the 
same distribution is used for all sorts in a particular study, the following formula 
can be used: i 


pel 


2N 


where r — the product-moment correlation between two sorts 

N = the number of items in the sample (100 in the illustrative sample of 
photographs of art objects mentioned in Chapter 17) 

o = the standard deviation of the forced distribution, which is the same 
for all sorts using the distribution 

d? = the squared difference between the pile number for an item in two 
sorts; thus, if a particular item is placed in pile 7 in one sort and 
pile 9 in another, the d? is 4. The sum of d? is taken over the N 


items 


The term 2Ne2 is a constant for all correlations computed with a particular 
forced distribution. After the constant is determined, it is only necessary to 
obtain the sum of squared differences between item placements to compute the 
correlation. For any particular Q-sort distribution, a table can be composed to 
transform sums of squared differences directly into correlation coefficients. The 
formula effects a great time saving for the individual who does not have high- 
speed computers available. A correlation can be determined by the formula in a 
small fraction of the time it takes to use the customary product-moment 


formulas. 
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Some Semantic Differential Scales W hich Have 


APPENDIX 8 


Been Used in Various Studies 


angular-rounded 
bass-treble 
beautiful-ugly 
black-white 
blatant-muted 
boring-interesting 
brave-cowardl y 
bright-dark 
calm-agitated 
calming-exciting 


. chaotic-ordered 


clean-dirty 


. clear-haz 

- colorful-colorless 
cConcentrated- diffuse 
. controlled-accidental 
- deep-shallow 

- definite-uncertain 

- deliberate-careless 

$ dull-exciting 

y dull-sharp 

+ emotional-rational 

. empty-full 

» even-uneven 

+ expensive-cheap 


fair-unfair 


; familiar-strange 

- fast-slow 

k ferocious-peaceful 
- formal-informal 

- fragrant-foul 


fresh-stale 


- gentle-violent 


* Senuine-artificial 
s gliding-scraping 


- good-bad 


- happy-sad 
- hard-soft 
- healthy-sick 


heav -light 


, high-class-low-class 
- high-low 

- honest-dishonest 

- hot-cold 

- humorous-serious 

- important-trivial 

- intimate-remote 

- kind-cruel 

E labored-easy 

- large-small 


long-short 


À loose-tight 

. loud-soft 

. lush-austere 

- masculine-feminine 
- meaningless-meaningful 
. mild-intense 

- modern-old-fashioned 
- near-far 

- nice-awful 

- obvious-subtle 
pleasant-unpleasant 
Pleasing- annoying 
powerful weak 

. pungent-bland 

E pushing-pulling 

. rumbling-whining 

. red-green 

. relaxed-tense 

. repeated-varied 

. repetitive-varied 

. resting-busy 


rich- poor 

- rich-thin 

- rough-smooth 

- rugged-delicate 
Sacred: profane 
safe-dangerous 
simple- complex 
* Sincerc-insincere 
- solid-hollow 

+ Static-dynamic 

+ Steady-fluttering 
. strong-weak 

< Superficial-profound 
- Sweet-bitter 
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88. 
89. 
90. 
91. 
92. 
93. 
94. 
95. 
96. 
97. 
98. 
99. 
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sweet-sour 
tv-distasteful 
thick-thin 
unbelievable-believable 
unique-commonplace 
usual-unusual 
vague-precise 
valuable-worthless 
vibrant-still 

wet-drv 

wide-narrow 
vellow-blue 
young-old 
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The Influence of Reliability 
on Predictive Validity 


In Chapter 6, we discussed the correction for attenuation, which estimates 
how highly a test would correlate with an assessment if both were made per- 
fectly reliable. A more useful formula is that for estimating the validity of a 
test if the reliabilities of the test and the assessment are raised or lowered by 
particular amounts. The formula is as follows: 


where r;; and Ta» = the original reliabilities of the predictor test and the assess- 
ment respectively 
rand r, = 


the new reliabilities of the predictor test and assessment 
the original correlation between the test and the assessment 
ie = the estimated correlation between the test and the assess- 
ment after their respective reliabilities have been incr 
or decreased 

The formula can be illustrated in a situation in which a test is used to predict 
ratings of performance. Assume that the original reliabilities of the test and the 
assessment are .64 and .49 respectively and that the correlation between the 
test and the essment is .36. Suppose that, by improving the test and the 
assessment, their reliabilities are raised to .81 and .64 respectively. The esti- 


piles validity of the test after the reliabilities are increased is obtained as 
ollows: 


36 W. SI \/64 
= Ja Vds 
360.9 x .8) 

8 xX.7 


The formula will 
the test and the 
decreased. 


also serve to estimate the validity of a test if the reliabilities of 
assessment are decreased or if one is increased and the other 
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